corpora.mmcorpus – Corpus in Matrix Market format

`corpora.mmcorpus` – Corpus in Matrix Market format¶

Corpus in the Matrix Market format.

class gensim.corpora.mmcorpus.MmCorpus(fname)¶

Bases: gensim.matutils.MmReader, gensim.corpora.indexedcorpus.IndexedCorpus

Corpus in the Matrix Market format.

docbyoffset(offset)¶: Return document at file offset offset (in bytes)

load(fname, mmap=None)¶

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)¶

static save_corpus(fname, corpus, id2word=None, progress_cnt=1000, metadata=False)¶

Save a corpus in the Matrix Market format to disk.

This function is automatically called by MmCorpus.serialize; don’t call it directly, call serialize instead.

serialize(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶

Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).

This relies on the underlying corpus class serializer providing (in addition to standard iteration):

save_corpus method that returns a sequence of byte offsets, one for

each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).

Example:

>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.

skip_headers(input_file)¶: Skip file headers that appear before the first document.

Get Expert Help

corpora.mmcorpus – Corpus in Matrix Market format¶

`corpora.mmcorpus` – Corpus in Matrix Market format¶