Text API¶
Overview¶
The mxnet.contrib.text
APIs refer to classes and functions related to text data processing, such
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the
mxnet.ndarray.NDArray
format.
Warning
This package contains experimental APIs and may change in the near future.
This document lists the text APIs in mxnet:
mxnet.contrib.text.embedding |
Text token embeddings. |
mxnet.contrib.text.vocab |
Text token indexer. |
mxnet.contrib.text.utils |
Provide utilities for text data processing. |
All the code demonstrated in this document assumes that the following modules or packages are imported.
>>> from mxnet import gluon
>>> from mxnet import nd
>>> from mxnet.contrib import text
>>> import collections
Looking up pre-trained word embeddings for indexed words¶
As a common use case, let us look up pre-trained word embedding vectors for indexed words in just a few lines of code.
To begin with, Suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)
The obtained counter
has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for all the keys in counter
and load the defined fastText
word embedding for all such indexed words. First, we need a Vocabulary object with counter
as its
argument
>>> my_vocab = text.vocab.Vocabulary(counter)
We can create a fastText word embedding object by specifying the embedding name fasttext
and
the pre-trained file wiki.simple.vec
. We also specify that the indexed tokens for loading the
fastText word embedding come from the defined Vocabulary object my_vocab
.
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
... vocabulary=my_vocab)
Now we are ready to look up the fastText word embedding vectors for indexed words, such as ‘hello’ and ‘world’.
>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
Using pre-trained word embeddings in gluon
¶
To demonstrate how to use pre-trained word embeddings in the gluon
package, let us first obtain
indices of the words ‘hello’ and ‘world’.
>>> my_embedding.to_indices(['hello', 'world'])
[2, 1]
We can obtain the vector representation for the words ‘hello’ and ‘world’ by specifying their
indices (2 and 1) and the my_embedding.idx_to_vec
in mxnet.gluon.nn.Embedding
.
>>> layer = gluon.nn.Embedding(len(my_embedding), my_embedding.vec_len)
>>> layer.initialize()
>>> layer.weight.set_data(my_embedding.idx_to_vec)
>>> layer(nd.array([2, 1]))
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
Vocabulary¶
The vocabulary builds indices for text tokens. Such indexed tokens can be used by token embedding
instances. The input counter whose keys are candidate indices may be obtained via
count_tokens_from_str
.
Vocabulary |
Indexing for text tokens. |
Suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)
The obtained counter
has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the 2 most frequent keys in counter
with the unknown
token representation ‘
<
unk>
‘ and a reserved token ‘<
pad>
‘.>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<unk>',
... reserved_tokens=['<pad>'])
We can access properties such as token_to_idx
(mapping tokens to indices), idx_to_token
(mapping
indices to tokens), vec_len
(length of each embedding vector), and unknown_token
(representation
of any unknown token) and reserved_tokens
.
>>> my_vocab.token_to_idx
{'<unk>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
['<unk>', '<pad>', 'world', 'hello']
>>> my_vocab.unknown_token
'<unk>'
>>> my_vocab.reserved_tokens
['<pad>']
>>> len(my_vocab)
4
Besides the specified unknown token ‘
<
unk>
‘ and reserved_token ‘<
pad>
‘ are indexed, the 2 most frequent words ‘world’ and ‘hello’ are also indexed.Text token embedding¶
To load token embeddings from an externally hosted pre-trained token embedding file, such as those
of GloVe and FastText, use
embedding.create(embedding_name, pretrained_file_name)
.
To get all the available embedding_name
and pretrained_file_name
, use
embedding.get_pretrained_file_names()
.
>>> text.embedding.get_pretrained_file_names()
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...],
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]}
Alternatively, to load embedding vectors from a custom pre-trained text token
embedding file, use CustomEmbedding
.
Moreover, to load composite embedding vectors, such as to concatenate embedding vectors,
use CompositeEmbedding
.
The indexed tokens in a text token embedding may come from a vocabulary or from the loaded embedding vectors. In the former case, only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. In the later case, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, are taken as the indexed tokens of the embedding.
register |
Registers a new token embedding. |
create |
Creates an instance of token embedding. |
get_pretrained_file_names |
Get valid token embedding names and their pre-trained file names. |
GloVe |
The GloVe word embedding. |
FastText |
The fastText word embedding. |
CustomEmbedding |
User-defined token embedding. |
CompositeEmbedding |
Composite token embeddings. |
Indexed tokens are from a vocabulary¶
One can specify that only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file.
To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)
The obtained counter
has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the most frequent 2 keys in counter
and load the defined
fastText word embedding with pre-trained file wiki.simple.vec
for all these 2 words.
>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2)
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
... vocabulary=my_vocab)
Now we are ready to look up the fastText word embedding vectors for indexed words.
>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
We can also access properties such as token_to_idx
(mapping tokens to indices), idx_to_token
(mapping indices to tokens), and vec_len
(length of each embedding vector).
>>> my_embedding.token_to_idx
{'<unk>': 0, 'world': 1, 'hello': 2}
>>> my_embedding.idx_to_token
['<unk>', 'world', 'hello']
>>> len(my_embedding)
3
>>> my_embedding.vec_len
300
If a token is unknown to glossary
, its embedding vector is initialized according to the default
specification in fasttext_simple
(all elements are 0).
>>> my_embedding.get_vecs_by_tokens('nice')
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Indexed tokens are from the loaded embedding vectors¶
One can also use all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, as the indexed tokens of the embedding.
To begin with, we can create a fastText word embedding object by specifying the embedding name
‘fasttext’ and the pre-trained file ‘wiki.simple.vec’. The argument init_unknown_vec
specifies
default vector representation for any unknown token. To index all the tokens from this pre-trained
word embedding file, we do not need to specify any vocabulary.
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
... init_unknown_vec=nd.zeros)
We can access properties such as token_to_idx
(mapping tokens to indices), idx_to_token
(mapping
indices to tokens), vec_len
(length of each embedding vector), and unknown_token
(representation
of any unknown token, default value is ‘
<
unk>
‘).>>> my_embedding.token_to_idx['nice']
2586
>>> my_embedding.idx_to_token[2586]
'nice'
>>> my_embedding.vec_len
300
>>> my_embedding.unknown_token
'<unk>'
For every unknown token, if its representation ‘
<
unk>
‘ is encountered in the pre-trained token embedding file, index 0 of propertyidx_to_vec
maps to the pre-trained token embedding vector
loaded from the file; otherwise, index 0 of property idx_to_vec
maps to the default token
embedding vector specified via init_unknown_vec
(set to nd.zeros here). Since the pre-trained file
does not have a vector for the token ‘<
unk>
‘, index 0 has to map to an additional token ‘<
unk>
‘ and the number of tokens in the embedding is 111,052.>>> len(my_embedding)
111052
>>> my_embedding.idx_to_vec[0]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> my_embedding.get_vecs_by_tokens('nice')
[ 0.49397001 0.39996001 0.24000999 -0.15121 -0.087512 0.37114
...
0.089521 0.29175001 -0.40917999 -0.089206 -0.1816 -0.36616999]
>>> my_embedding.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN'])
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Implement a new text token embedding¶
For optimizer
, create a subclass of mxnet.contrib.text.embedding._TokenEmbedding
.
Also add @mxnet.contrib.text.embedding._TokenEmbedding.register
before this class. See
embedding.py
for examples.
Text utilities¶
The following functions provide utilities for text data processing.
count_tokens_from_str |
Counts tokens in the specified string. |
API Reference¶
Text token embeddings.
-
mxnet.contrib.text.embedding.
register
(embedding_cls)[source]¶ Registers a new token embedding.
Once an embedding is registered, we can create an instance of this embedding with
create()
.Examples
>>> @mxnet.contrib.text.embedding.register ... class MyTextEmbed(mxnet.contrib.text.embedding._TokenEmbedding): ... def __init__(self, pretrained_file_name='my_pretrain_file'): ... pass >>> embed = mxnet.contrib.text.embedding.create('MyTokenEmbed') >>> print(type(embed))
-
mxnet.contrib.text.embedding.
create
(embedding_name, **kwargs)[source]¶ Creates an instance of token embedding.
Creates a token embedding instance by loading embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid embedding_name and pretrained_file_name, use mxnet.contrib.text.embedding.get_pretrained_file_names().
Parameters: embedding_name (str) – The token embedding name (case-insensitive). Returns: A token embedding instance that loads embedding vectors from an externally hosted pre-trained token embedding file. Return type: An instance of mxnet.contrib.text.glossary._TokenEmbedding
-
mxnet.contrib.text.embedding.
get_pretrained_file_names
(embedding_name=None)[source]¶ Get valid token embedding names and their pre-trained file names.
To load token embedding vectors from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, one should use mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name). This method returns all the valid names of pretrained_file_name for the specified embedding_name. If embedding_name is set to None, this method returns all the valid names of embedding_name with their associated pretrained_file_name.
Parameters: embedding_name (str or None, default None) – The pre-trained token embedding name. Returns: A list of all the valid pre-trained token embedding file names (pretrained_file_name) for the specified token embedding name (embedding_name). If the text embeding name is set to None, returns a dict mapping each valid token embedding name to a list of valid pre-trained files (pretrained_file_name). They can be plugged into mxnet.contrib.text.embedding.create(embedding_name, pretrained_file_name). Return type: dict or list
-
class
mxnet.contrib.text.embedding.
GloVe
(pretrained_file_name='glove.840B.300d.txt', embedding_root=u'/work/mxnet/docs/build_version_doc/apache-mxnet/v1.5.x/embeddings', init_unknown_vec=, vocabulary=None, **kwargs)[source]¶ The GloVe word embedding.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. (Source from https://nlp.stanford.edu/projects/glove/)
References
GloVe: Global Vectors for Word Representation. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. https://nlp.stanford.edu/pubs/glove.pdf
Website:
https://nlp.stanford.edu/projects/glove/
To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://nlp.stanford.edu/projects/glove/
License for pre-trained embeddings:
Parameters: - pretrained_file_name (str, default 'glove.840B.300d.txt') – The name of the pre-trained token embedding file.
- embedding_root (str, default $MXNET_HOME/embeddings) – The root directory for storing embedding-related files.
- init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token.
- vocabulary (
Vocabulary
, default None) – It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed.
-
get_vecs_by_tokens
(tokens, lower_case_backup=False)¶ Look up embedding vectors of tokens.
Parameters: - tokens (str or list of strs) – A token or a list of tokens.
- lower_case_backup (bool, default False) – If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property token_to_idx, the token in the lower case will be looked up.
Returns: The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray of shape self.vec_len; if tokens is a list of strings, returns a 2-D NDArray of shape=(len(tokens), self.vec_len).
Return type:
-
to_indices
(tokens)¶ Converts tokens to indices according to the vocabulary.
Parameters: tokens (str or list of strs) – A source token or tokens to be converted. Returns: A token index or a list of token indices according to the vocabulary. Return type: int or list of ints
-
to_tokens
(indices)¶ Converts token indices to tokens according to the vocabulary.
Parameters: indices (int or list of ints) – A source token index or token indices to be converted. Returns: A token or a list of tokens according to the vocabulary. Return type: str or list of strs
-
update_token_vectors
(tokens, new_vectors)¶ Updates embedding vectors for tokens.
Parameters: - tokens (str or a list of strs) – A token or a list of tokens whose embedding vector are to be updated.
- new_vectors (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embeddings of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
-
class
mxnet.contrib.text.embedding.
FastText
(pretrained_file_name='wiki.simple.vec', embedding_root=u'/work/mxnet/docs/build_version_doc/apache-mxnet/v1.5.x/embeddings', init_unknown_vec=, vocabulary=None, **kwargs)[source]¶ The fastText word embedding.
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (Source from https://fasttext.cc/)
References
Enriching Word Vectors with Subword Information. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. https://arxiv.org/abs/1607.04606
Bag of Tricks for Efficient Text Classification. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. https://arxiv.org/abs/1607.01759
FastText.zip: Compressing text classification models. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov. https://arxiv.org/abs/1612.03651
For ‘wiki.multi’ embeddings: Word Translation Without Parallel Data Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. https://arxiv.org/abs/1710.04087
Website:
To get the updated URLs to the externally hosted pre-trained token embedding files, visit https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md
License for pre-trained embeddings:
Parameters: - pretrained_file_name (str, default 'wiki.en.vec') – The name of the pre-trained token embedding file.
- embedding_root (str, default $MXNET_HOME/embeddings) – The root directory for storing embedding-related files.
- init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token.
- vocabulary (
Vocabulary
, default None) – It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed.
-
get_vecs_by_tokens
(tokens, lower_case_backup=False)¶ Look up embedding vectors of tokens.
Parameters: - tokens (str or list of strs) – A token or a list of tokens.
- lower_case_backup (bool, default False) – If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property token_to_idx, the token in the lower case will be looked up.
Returns: The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray of shape self.vec_len; if tokens is a list of strings, returns a 2-D NDArray of shape=(len(tokens), self.vec_len).
Return type:
-
to_indices
(tokens)¶ Converts tokens to indices according to the vocabulary.
Parameters: tokens (str or list of strs) – A source token or tokens to be converted. Returns: A token index or a list of token indices according to the vocabulary. Return type: int or list of ints
-
to_tokens
(indices)¶ Converts token indices to tokens according to the vocabulary.
Parameters: indices (int or list of ints) – A source token index or token indices to be converted. Returns: A token or a list of tokens according to the vocabulary. Return type: str or list of strs
-
update_token_vectors
(tokens, new_vectors)¶ Updates embedding vectors for tokens.
Parameters: - tokens (str or a list of strs) – A token or a list of tokens whose embedding vector are to be updated.
- new_vectors (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embeddings of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
-
class
mxnet.contrib.text.embedding.
CustomEmbedding
(pretrained_file_path, elem_delim=' ', encoding='utf8', init_unknown_vec=, vocabulary=None, **kwargs)[source]¶ User-defined token embedding.
This is to load embedding vectors from a user-defined pre-trained text embedding file.
Denote by ‘[ed]’ the argument elem_delim. Denote by [v_ij] the j-th element of the token embedding vector for [token_i], the expected format of a custom pre-trained token embedding file is:
‘[token_1][ed][v_11][ed][v_12][ed]...[ed][v_1k]\n[token_2][ed][v_21][ed][v_22][ed]...[ed] [v_2k]\n...’
where k is the length of the embedding vector vec_len.
Parameters: - pretrained_file_path (str) – The path to the custom pre-trained token embedding file.
- elem_delim (str, default ' ') – The delimiter for splitting a token and every embedding vector element value on the same line of the custom pre-trained token embedding file.
- encoding (str, default 'utf8') – The encoding scheme for reading the custom pre-trained token embedding file.
- init_unknown_vec (callback) – The callback used to initialize the embedding vector for the unknown token.
- vocabulary (
Vocabulary
, default None) – It contains the tokens to index. Each indexed token will be associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. If None, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, will be indexed.
-
get_vecs_by_tokens
(tokens, lower_case_backup=False)¶ Look up embedding vectors of tokens.
Parameters: - tokens (str or list of strs) – A token or a list of tokens.
- lower_case_backup (bool, default False) – If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property token_to_idx, the token in the lower case will be looked up.
Returns: The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray of shape self.vec_len; if tokens is a list of strings, returns a 2-D NDArray of shape=(len(tokens), self.vec_len).
Return type:
-
to_indices
(tokens)¶ Converts tokens to indices according to the vocabulary.
Parameters: tokens (str or list of strs) – A source token or tokens to be converted. Returns: A token index or a list of token indices according to the vocabulary. Return type: int or list of ints
-
to_tokens
(indices)¶ Converts token indices to tokens according to the vocabulary.
Parameters: indices (int or list of ints) – A source token index or token indices to be converted. Returns: A token or a list of tokens according to the vocabulary. Return type: str or list of strs
-
update_token_vectors
(tokens, new_vectors)¶ Updates embedding vectors for tokens.
Parameters: - tokens (str or a list of strs) – A token or a list of tokens whose embedding vector are to be updated.
- new_vectors (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embeddings of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
-
class
mxnet.contrib.text.embedding.
CompositeEmbedding
(vocabulary, token_embeddings)[source]¶ Composite token embeddings.
For each indexed token in a vocabulary, multiple embedding vectors, such as concatenated multiple embedding vectors, will be associated with it. Such embedding vectors can be loaded from externally hosted or custom pre-trained token embedding files, such as via token embedding instances.
Parameters: - vocabulary (
Vocabulary
) – For each indexed token in a vocabulary, multiple embedding vectors, such as concatenated multiple embedding vectors, will be associated with it. - token_embeddings (instance or list of mxnet.contrib.text.embedding._TokenEmbedding) – One or multiple pre-trained token embeddings to load. If it is a list of multiple embeddings, these embedding vectors will be concatenated for each token.
-
get_vecs_by_tokens
(tokens, lower_case_backup=False)¶ Look up embedding vectors of tokens.
Parameters: - tokens (str or list of strs) – A token or a list of tokens.
- lower_case_backup (bool, default False) – If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property token_to_idx, the token in the lower case will be looked up.
Returns: The embedding vector(s) of the token(s). According to numpy conventions, if tokens is a string, returns a 1-D NDArray of shape self.vec_len; if tokens is a list of strings, returns a 2-D NDArray of shape=(len(tokens), self.vec_len).
Return type:
-
to_indices
(tokens)¶ Converts tokens to indices according to the vocabulary.
Parameters: tokens (str or list of strs) – A source token or tokens to be converted. Returns: A token index or a list of token indices according to the vocabulary. Return type: int or list of ints
-
to_tokens
(indices)¶ Converts token indices to tokens according to the vocabulary.
Parameters: indices (int or list of ints) – A source token index or token indices to be converted. Returns: A token or a list of tokens according to the vocabulary. Return type: str or list of strs
-
update_token_vectors
(tokens, new_vectors)¶ Updates embedding vectors for tokens.
Parameters: - tokens (str or a list of strs) – A token or a list of tokens whose embedding vector are to be updated.
- new_vectors (mxnet.ndarray.NDArray) – An NDArray to be assigned to the embedding vectors of tokens. Its length must be equal to the number of tokens and its width must be equal to the dimension of embeddings of the glossary. If tokens is a singleton, it must be 1-D or 2-D. If tokens is a list of multiple strings, it must be 2-D.
- vocabulary (
Text token indexer.
-
class
mxnet.contrib.text.vocab.
Vocabulary
(counter=None, most_freq_count=None, min_freq=1, unknown_token='' , reserved_tokens=None)[source]¶ Indexing for text tokens.
Build indices for the unknown token, reserved tokens, and input counter keys. Indexed tokens can be used by token embeddings.
Parameters: - counter (collections.Counter or None, default None) – Counts text token frequencies in the text data. Its keys will be indexed according to frequency thresholds such as most_freq_count and min_freq. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.
- most_freq_count (None or int, default None) – The maximum possible number of the most frequent tokens in the keys of counter that can be indexed. Note that this argument does not count any token from reserved_tokens. Suppose that there are different keys of counter whose frequency are the same, if indexing all of them will exceed this argument value, such keys will be indexed one by one according to their __cmp__() order until the frequency threshold is met. If this argument is None or larger than its largest possible value restricted by counter and reserved_tokens, this argument has no effect.
- min_freq (int, default 1) – The minimum frequency required for a token in the keys of counter to be indexed.
- unknown_token (hashable object, default '<unk>') – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.
- reserved_tokens (list of hashable objects or None, default None) – A list of reserved tokens that will always be indexed, such as special symbols representing padding, beginning of sentence, and end of sentence. It cannot contain unknown_token, or duplicate reserved tokens. Keys of counter, unknown_token, and values of reserved_tokens must be of the same hashable type. Examples: str, int, and tuple.
-
unknown_token
¶ hashable object – The representation for any unknown token. In other words, any unknown token will be indexed as the same representation.
-
reserved_tokens
¶ list of strs or None – A list of reserved tokens that will always be indexed.
Provide utilities for text data processing.
-
mxnet.contrib.text.utils.
count_tokens_from_str
(source_str, token_delim=' ', seq_delim='\n', to_lower=False, counter_to_update=None)[source]¶ Counts tokens in the specified string.
For token_delim=’
’ and seq_delim=’ ’, a specified string of two sequences of tokens may look like: <td>token1<td>token2<td>token3<td><sd><td>token4<td>token5<td><sd>
and are regular expressions. Make use of \ to allow special characters as delimiters. The list of special characters can be found at https://docs.python.org/3/library/re.html. Parameters: - source_str (str) – A source string of tokens.
- token_delim (str, default ' ') – A token delimiter.
- seq_delim (str, default '\n') – A sequence delimiter.
- to_lower (bool, default False) – Whether to convert the source source_str to the lower case.
- counter_to_update (collections.Counter or None, default None) – The collections.Counter instance to be updated with the token counts of source_str. If None, return a new collections.Counter instance counting tokens from source_str.
Returns: The counter_to_update collections.Counter instance after being updated with the token counts of source_str. If counter_to_update is None, return a new collections.Counter instance counting tokens from source_str.
Return type: collections.Counter
Examples
>>> source_str = ' Life is great ! \n life is good . \n' >>> count_tokens_from_str(token_line, ' ', '\n', True) Counter({'!': 1, '.': 1, 'good': 1, 'great': 1, 'is': 2, 'life': 2})
>>> source_str = '*Life*is*great*!*\n*life*is*good*.*\n' >>> count_tokens_from_str(token_line, '\*', '\n', True) Counter({'is': 2, 'life': 2, '!': 1, 'great': 1, 'good': 1, '.': 1})