corpora.bleicorpus
– Corpus in Blei’s LDA-C format¶Blei’s LDA-C format.
gensim.corpora.bleicorpus.
BleiCorpus
(fname, fname_vocab=None)¶Bases: gensim.corpora.indexedcorpus.IndexedCorpus
Corpus in Blei’s LDA-C format.
The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.
Each document is one line:
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN
The vocabulary is a file with words, one word per line; word at line K has an
implicit id=K
.
Initialize the corpus from a file.
fname_vocab is the file with vocabulary; if not specified, it defaults to fname.vocab.
docbyoffset
(offset)¶Return the document stored at file position offset.
line2doc
(line)¶load
(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
save
(*args, **kwargs)¶save_corpus
(fname, corpus, id2word=None, metadata=False)¶Save a corpus in the LDA-C format.
There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.
This function is automatically called by BleiCorpus.serialize; don’t call it directly, call serialize instead.
serialize
(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.