• Docs >
  • torchtext.experimental.datasets
Shortcuts

torchtext.experimental.datasets

The following datasets have been rewritten and more compatible with torch.utils.data. General use cases are as follows:

# import datasets
from torchtext.experimental.datasets import IMDB

# set up tokenizer (the default on is basic_english tokenizer)
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")

# obtain data and vocab with a custom tokenizer
train_dataset, test_dataset = IMDB(tokenizer=tokenizer)
vocab = train_dataset.get_vocab()

# use the default tokenizer
train_dataset, test_dataset = IMDB()
vocab = train_dataset.get_vocab()

The following datasets are available:

Sentiment Analysis

IMDb

class torchtext.experimental.datasets.IMDB[source]
Defines IMDB datasets.
The labels includes:
  • 0 : Negative

  • 1 : Positive

Create sentiment analysis dataset: IMDB

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • removed_tokens – removed tokens from output dataset (Default: [])

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import IMDB
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = IMDB(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = IMDB(tokenizer=tokenizer)
>>> train, = IMDB(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

Language Modeling

Language modeling datasets are subclasses of LanguageModelingDataset class.

class torchtext.experimental.datasets.LanguageModelingDataset(data, vocab)[source]

Defines a dataset for language modeling. Currently, we only support the following datasets:

  • WikiText2

  • WikiText103

  • PennTreebank

__init__(data, vocab)[source]

Initiate language modeling dataset.

Parameters
  • data – a tensor of tokens. tokens are ids after numericalizing the string tokens. torch.tensor([token_id_1, token_id_2, token_id_3, token_id1]).long()

  • vocab – Vocabulary object used for dataset.

Examples

>>> from torchtext.vocab import build_vocab_from_iterator
>>> data = torch.tensor([token_id_1, token_id_2,
                         token_id_3, token_id_1]).long()
>>> vocab = build_vocab_from_iterator([['language', 'modeling']])
>>> dataset = LanguageModelingDataset(data, vocab)

WikiText-2

class torchtext.experimental.datasets.WikiText2[source]

Defines WikiText2 datasets.

Create language modeling dataset: WikiText2 Separately returns the train/test/valid set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • removed_tokens – removed tokens from output dataset (Default: [])

  • data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import WikiText2
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, test_dataset, valid_dataset = WikiText2(tokenizer=tokenizer)
>>> vocab = train_dataset.get_vocab()
>>> valid_dataset, = WikiText2(tokenizer=tokenizer, vocab=vocab,
                               data_select='valid')
__init__ = <method-wrapper '__init__' of function object>

WikiText103

class torchtext.experimental.datasets.WikiText103[source]

Defines WikiText103 datasets.

Create language modeling dataset: WikiText103 Separately returns the train/test/valid set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • data_select – the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’). If ‘train’ is not in the tuple, an vocab object should be provided which will be used to process valid and/or test data.

  • removed_tokens – removed tokens from output dataset (Default: [])

  • data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import WikiText103
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, test_dataset, valid_dataset = WikiText103(tokenizer=tokenizer)
>>> vocab = train_dataset.get_vocab()
>>> valid_dataset, = WikiText103(tokenizer=tokenizer, vocab=vocab,
                                 data_select='valid')
__init__ = <method-wrapper '__init__' of function object>

PennTreebank

class torchtext.experimental.datasets.PennTreebank[source]

Defines PennTreebank datasets.

Create language modeling dataset: PennTreebank Separately returns the train/test/valid set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • removed_tokens – removed tokens from output dataset (Default: [])

  • data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import PennTreebank
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, test_dataset, valid_dataset = PennTreebank(tokenizer=tokenizer)
>>> vocab = train_dataset.get_vocab()
>>> valid_dataset, = PennTreebank(tokenizer=tokenizer, vocab=vocab,
                                  data_select='valid')
__init__ = <method-wrapper '__init__' of function object>

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources