torchtext.data.functional¶

generate_sp_model¶

torchtext.data.functional.generate_sp_model(filename, vocab_size=20000, model_type='unigram', model_prefix='m_user')[source]¶

Train a SentencePiece tokenizer.

Parameters

filename – the data file for training SentencePiece model.
vocab_size – the size of vocabulary (Default: 20,000).
model_type – the type of SentencePiece model, including unigram, bpe, char, word.
model_prefix – the prefix of the files saving model and vocab.

Outputs:

The model and vocab are saved in two separate files with: model_prefix.

Examples

>>> from torchtext.data.functional import generate_sp_model
>>> generate_sp_model('test.csv', vocab_size=23456, model_prefix='spm_user')

load_sp_model¶

torchtext.data.functional.load_sp_model(spm_path)[source]¶

Load a sentencepiece model for file.

Parameters: spm_path – the file path saving the sentencepiece model.

Outputs:: output: a SentencePiece model.

Examples

>>> from torchtext.data.functional import load_sp_model
>>> sp_model = load_sp_model("m_user.model")

sentencepiece_numericalizer¶

torchtext.data.functional.sentencepiece_numericalizer(sp_model)[source]¶

A sentencepiece model to numericalize a text sentence into: a generator over the ids.

Parameters: sp_model – a SentencePiece model.

Outputs:

output: a generator with the input of text sentence and the output of the: corresponding ids based on SentencePiece model.

Examples

>>> from torchtext.data.functional import sentencepiece_numericalizer
>>> sp_id_generator = sentencepiece_numericalizer(sp_model)
>>> list_a = ["sentencepiece encode as pieces", "examples to   try!"]
>>> list(sp_id_generator(list_a))
    [[9858, 9249, 1629, 1305, 1809, 53, 842],
     [2347, 13, 9, 150, 37]]

sentencepiece_tokenizer¶

torchtext.data.functional.sentencepiece_tokenizer(sp_model)[source]¶

A sentencepiece model to tokenize a text sentence into: a generator over the tokens.

Parameters: sp_model – a SentencePiece model.

Outputs:

output: a generator with the input of text sentence and the output of the: corresponding tokens based on SentencePiece model.

Examples

>>> from torchtext.data.functional import sentencepiece_tokenizer
>>> sp_tokens_generator = sentencepiece_tokenizer(sp_model)
>>> list_a = ["sentencepiece encode as pieces", "examples to   try!"]
>>> list(sp_tokens_generator(list_a))
    [['_sentence', 'piece', '_en', 'co', 'de', '_as', '_pieces'],
     ['_example', 's', '_to', '_try', '!']]

custom_replace¶

torchtext.data.functional.custom_replace(replace_pattern)[source]¶

A transform to convert text string.

Examples

>>> from torchtext.data.functional import custom_replace
>>> custom_replace_transform = custom_replace([(r'S', 's'), (r'\s+', ' ')])
>>> list_a = ["Sentencepiece encode  aS  pieces", "exampleS to   try!"]
>>> list(custom_replace_transform(list_a))
    ['sentencepiece encode as pieces', 'examples to try!']

simple_space_split¶

torchtext.data.functional.simple_space_split(iterator)[source]¶

A transform to split text string by spaces.

Examples

>>> from torchtext.data.functional import simple_space_split
>>> list_a = ["Sentencepiece encode as pieces", "example to try!"]
>>> list(simple_space_split(list_a))
    [['Sentencepiece', 'encode', 'as', 'pieces'], ['example', 'to', 'try!']]

numericalize_tokens_from_iterator¶

torchtext.data.functional.numericalize_tokens_from_iterator(vocab, iterator, removed_tokens=None)[source]¶

Yield a list of ids from an token iterator with a vocab.

Parameters

vocab – the vocabulary convert token into id.
iterator – the iterator yield a list of tokens.
removed_tokens – removed tokens from output dataset (Default: None)

Examples

>>> from torchtext.data.functional import simple_space_split
>>> from torchtext.data.functional import numericalize_tokens_from_iterator
>>> vocab = {'Sentencepiece' : 0, 'encode' : 1, 'as' : 2, 'pieces' : 3}
>>> ids_iter = numericalize_tokens_from_iterator(vocab,
>>>                               simple_space_split(["Sentencepiece as pieces",
>>>                                                   "as pieces"]))
>>> for ids in ids_iter:
>>>     print([num for num in ids])
>>> [0, 2, 3]
>>> [2, 3]

torchtext.data.functional¶

generate_sp_model¶

load_sp_model¶

sentencepiece_numericalizer¶

sentencepiece_tokenizer¶

custom_replace¶

simple_space_split¶

numericalize_tokens_from_iterator¶

Docs

Tutorials

Resources