torchtext.experimental.datasets¶
The following datasets have been rewritten and more compatible with torch.utils.data
. General use cases are as follows:
# import datasets
from torchtext.experimental.datasets import IMDB
# set up tokenizer (the default on is basic_english tokenizer)
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
# obtain data and vocab with a custom tokenizer
train_dataset, test_dataset = IMDB(tokenizer=tokenizer)
vocab = train_dataset.get_vocab()
# use the default tokenizer
train_dataset, test_dataset = IMDB()
vocab = train_dataset.get_vocab()
The following datasets are available:
Sentiment Analysis¶
IMDb¶
-
class
torchtext.experimental.datasets.
IMDB
[source]¶ - Defines IMDB datasets.
- The labels includes:
0 : Negative
1 : Positive
Create sentiment analysis dataset: IMDB
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
removed_tokens – removed tokens from output dataset (Default: [])
tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.
data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.
Examples
>>> from torchtext.experimental.datasets import IMDB >>> from torchtext.data.utils import get_tokenizer >>> train, test = IMDB(ngrams=3) >>> tokenizer = get_tokenizer("spacy") >>> train, test = IMDB(tokenizer=tokenizer) >>> train, = IMDB(tokenizer=tokenizer, data_select='train')
-
__init__
= <method-wrapper '__init__' of function object>¶
Language Modeling¶
Language modeling datasets are subclasses of LanguageModelingDataset
class.
-
class
torchtext.experimental.datasets.
LanguageModelingDataset
(data, vocab)[source]¶ Defines a dataset for language modeling. Currently, we only support the following datasets:
WikiText2
WikiText103
PennTreebank
-
__init__
(data, vocab)[source]¶ Initiate language modeling dataset.
- Parameters
data – a tensor of tokens. tokens are ids after numericalizing the string tokens. torch.tensor([token_id_1, token_id_2, token_id_3, token_id1]).long()
vocab – Vocabulary object used for dataset.
Examples
>>> from torchtext.vocab import build_vocab_from_iterator >>> data = torch.tensor([token_id_1, token_id_2, token_id_3, token_id_1]).long() >>> vocab = build_vocab_from_iterator([['language', 'modeling']]) >>> dataset = LanguageModelingDataset(data, vocab)
WikiText-2¶
-
class
torchtext.experimental.datasets.
WikiText2
[source]¶ Defines WikiText2 datasets.
Create language modeling dataset: WikiText2 Separately returns the train/test/valid set
- Parameters
tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.
root – Directory where the datasets are saved. Default: “.data”
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
removed_tokens – removed tokens from output dataset (Default: [])
data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.
Examples
>>> from torchtext.experimental.datasets import WikiText2 >>> from torchtext.data.utils import get_tokenizer >>> tokenizer = get_tokenizer("spacy") >>> train_dataset, test_dataset, valid_dataset = WikiText2(tokenizer=tokenizer) >>> vocab = train_dataset.get_vocab() >>> valid_dataset, = WikiText2(tokenizer=tokenizer, vocab=vocab, data_select='valid')
-
__init__
= <method-wrapper '__init__' of function object>¶
WikiText103¶
-
class
torchtext.experimental.datasets.
WikiText103
[source]¶ Defines WikiText103 datasets.
Create language modeling dataset: WikiText103 Separately returns the train/test/valid set
- Parameters
tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.
root – Directory where the datasets are saved. Default: “.data”
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
data_select – the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’). If ‘train’ is not in the tuple, an vocab object should be provided which will be used to process valid and/or test data.
removed_tokens – removed tokens from output dataset (Default: [])
data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.
Examples
>>> from torchtext.experimental.datasets import WikiText103 >>> from torchtext.data.utils import get_tokenizer >>> tokenizer = get_tokenizer("spacy") >>> train_dataset, test_dataset, valid_dataset = WikiText103(tokenizer=tokenizer) >>> vocab = train_dataset.get_vocab() >>> valid_dataset, = WikiText103(tokenizer=tokenizer, vocab=vocab, data_select='valid')
-
__init__
= <method-wrapper '__init__' of function object>¶
PennTreebank¶
-
class
torchtext.experimental.datasets.
PennTreebank
[source]¶ Defines PennTreebank datasets.
Create language modeling dataset: PennTreebank Separately returns the train/test/valid set
- Parameters
tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.
root – Directory where the datasets are saved. Default: “.data”
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
removed_tokens – removed tokens from output dataset (Default: [])
data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.
Examples
>>> from torchtext.experimental.datasets import PennTreebank >>> from torchtext.data.utils import get_tokenizer >>> tokenizer = get_tokenizer("spacy") >>> train_dataset, test_dataset, valid_dataset = PennTreebank(tokenizer=tokenizer) >>> vocab = train_dataset.get_vocab() >>> valid_dataset, = PennTreebank(tokenizer=tokenizer, vocab=vocab, data_select='valid')
-
__init__
= <method-wrapper '__init__' of function object>¶