torchtext.vocab¶
Vocab¶
-
class
torchtext.vocab.
Vocab
(counter, max_size=None, min_freq=1, specials=['<unk>', '<pad>'], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)[source]¶ Defines a vocabulary object that will be used to numericalize a field.
- Variables
~Vocab.freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
~Vocab.stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
~Vocab.itos – A list of token strings indexed by their numerical identifiers.
-
__init__
(counter, max_size=None, min_freq=1, specials=['<unk>', '<pad>'], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)[source]¶ Create a Vocab object from a collections.Counter.
- Parameters
counter – collections.Counter object holding the frequencies of each value found in the data.
max_size – The maximum size of the vocabulary, or None for no maximum. Default: None.
min_freq – The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.
specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary. Default: [‘<unk’>, ‘<pad>’]
vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
vectors_cache – directory for cached vectors. Default: ‘.vector_cache’
specials_first – Whether to add special tokens into the vocabulary at first. If it is False, they are added into the vocabulary at last. Default: True.
-
load_vectors
(vectors, **kwargs)[source]¶ - Parameters
vectors – one of or a list containing instantiations of the GloVe, CharNGram, or Vectors classes. Alternatively, one of or a list of available pretrained vectors: charngram.100d fasttext.en.300d fasttext.simple.300d glove.42B.300d glove.840B.300d glove.twitter.27B.25d glove.twitter.27B.50d glove.twitter.27B.100d glove.twitter.27B.200d glove.6B.50d glove.6B.100d glove.6B.200d glove.6B.300d
keyword arguments (Remaining) – Passed to the constructor of Vectors classes.
-
set_vectors
(stoi, vectors, dim, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]¶ Set the vectors for the Vocab instance from a collection of Tensors.
- Parameters
stoi – A dictionary of string to the index of the associated vector in the vectors input argument.
vectors – An indexed iterable (or other structure supporting __getitem__) that given an input index, returns a FloatTensor representing the vector for the token associated with the index. For example, vector[stoi[“string”]] should return the vector for “string”.
dim – The dimensionality of the vectors.
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
SubwordVocab¶
-
class
torchtext.vocab.
SubwordVocab
(counter, max_size=None, specials=['<pad>'], vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]¶ -
__init__
(counter, max_size=None, specials=['<pad>'], vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]¶ Create a revtok subword vocabulary from a collections.Counter.
- Parameters
counter – collections.Counter object holding the frequencies of each word found in the data.
max_size – The maximum size of the subword vocabulary, or None for no maximum. Default: None.
specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token.
vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_
-
Vectors¶
-
class
torchtext.vocab.
Vectors
(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]¶ -
__init__
(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]¶ - Parameters
name – name of the file that contains the vectors
cache – directory for cached vectors
url – url for download if vectors not found in cache
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
get_vecs_by_tokens
(tokens, lower_case_backup=False)[source]¶ Look up embedding vectors of tokens.
- Parameters
tokens – a token or a list of tokens. if tokens is a string, returns a 1-D tensor of shape self.dim; if tokens is a list of strings, returns a 2-D tensor of shape=(len(tokens), self.dim).
lower_case_backup – Whether to look up the token in the lower case. If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property stoi, the token in the lower case will be looked up. Default: False.
Examples
>>> examples = ['chip', 'baby', 'Beautiful'] >>> vec = text.vocab.GloVe(name='6B', dim=50) >>> ret = vec.get_vecs_by_tokens(tokens, lower_case_backup=True)
-
Pretrained Word Embeddings¶
GloVe¶
-
class
torchtext.vocab.
GloVe
(name='840B', dim=300, **kwargs)[source]¶ -
__init__
(name='840B', dim=300, **kwargs)[source]¶ Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors
to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
- max_vectors (int): this can be used to limit the number of
pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
FastText¶
-
class
torchtext.vocab.
FastText
(language='en', **kwargs)[source]¶ -
__init__
(language='en', **kwargs)[source]¶ Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors
to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
- max_vectors (int): this can be used to limit the number of
pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
CharNGram¶
-
class
torchtext.vocab.
CharNGram
(**kwargs)[source]¶ -
__init__
(**kwargs)[source]¶ Arguments: name: name of the file that contains the vectors cache: directory for cached vectors url: url for download if vectors not found in cache unk_init (callback): by default, initialize out-of-vocabulary word vectors
to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
- max_vectors (int): this can be used to limit the number of
pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-