models.phrases
– Phrase (collocation) detection¶Automatically detect common phrases (multiword expressions) from a stream of sentences.
The phrases are collocations (frequently co-occurring tokens). See [1] for the exact formula.
For example, if your input stream (=an iterable, with each value a list of token strings) looks like:
>>> print(list(sentence_stream))
[[u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'],
[u'machine', u'learning', u'can', u'be', u'useful', u'sometimes'],
...,
]
you’d train the detector with:
>>> bigram = Phrases(sentence_stream)
and then transform any sentence (list of token strings) using the standard gensim syntax:
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
(note new_york became a single token). As usual, you can also transform an entire sentence stream using:
>>> print(list(bigram[any_sentence_stream]))
[[u'the', u'mayor', u'of', u'new_york', u'was', u'there'],
[u'machine_learning', u'can', u'be', u'useful', u'sometimes'],
...,
]
You can also continue updating the collocation counts with new sentences, by:
>>> bigram.add_vocab(new_sentence_stream)
These phrase streams are meant to be used during text preprocessing, before
converting the resulting tokens into vectors using `Dictionary`. See the
gensim.models.word2vec
module for an example application of using phrase detection.
The detection can also be run repeatedly, to get phrases longer than two tokens (e.g. new_york_times):
>>> trigram = Phrases(bigram[sentence_stream])
>>> sent = [u'the', u'new', u'york', u'times', u'is', u'a', u'newspaper']
>>> print(trigram[bigram[sent]])
[u'the', u'new_york_times', u'is', u'a', u'newspaper']
[1] | Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. |
gensim.models.phrases.
Phrases
(sentences=None, min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter='_')¶Bases: gensim.interfaces.TransformationABC
Detect phrases, based on collected collocation counts. Adjacent words that appear together more frequently than expected are joined together with the _ character.
It can be used to generate phrases on the fly, using the phrases[sentence] and phrases[corpus] syntax.
Initialize the model from an iterable of sentences. Each sentence must be a list of words (unicode strings) that will be used for training.
The sentences iterable can be simply a list, but for larger corpora,
consider a generator that streams the sentences directly from disk/network,
without storing everything in RAM. See BrownCorpus
,
Text8Corpus
or LineSentence
in the gensim.models.word2vec
module for such examples.
min_count ignore all words and bigrams with total collected count lower than this.
threshold represents a threshold for forming the phrases (higher means fewer phrases). A phrase of words a and b is accepted if (cnt(a, b) - min_count) * N / (cnt(a) * cnt(b)) > threshold, where N is the total vocabulary size.
max_vocab_size is the maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM; increase/decrease max_vocab_size depending on how much available memory you have.
delimiter is the glue character used to join collocation tokens, and should be a byte string (e.g. b’_’).
add_vocab
(sentences)¶Merge the collected counts vocab into this phrase detector.
export_phrases
(sentences)¶Generate an iterator that contains all phrases in given ‘sentences’
Example:
>>> sentences = Text8Corpus(path_to_corpus)
>>> bigram = Phrases(sentences, min_count=5, threshold=100)
>>> for phrase, score in bigram.export_phrases(sentences):
... print(u'{0} {1}'.format(phrase, score))
then you can debug the threshold with generated tsv
learn_vocab
(sentences, max_vocab_size, delimiter='_')¶Collect unigram/bigram counts from the sentences iterable.
load
(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)¶Save the object to file (also see load).
fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.