corpora.lowcorpus
– Corpus in List-of-Words format¶Corpus in GibbsLda++ format of List-Of-Words.
gensim.corpora.lowcorpus.
LowCorpus
(fname, id2word=None, line2words=<function split_on_space>)¶Bases: gensim.corpora.indexedcorpus.IndexedCorpus
List_Of_Words corpus handles input in GibbsLda++ format.
Quoting http://gibbslda.sourceforge.net/#3.2_Input_Data_Format:
Both data for training/estimating the model and new data (i.e., previously
unseen data) have the same format as follows:
[M]
[document1]
[document2]
...
[documentM]
in which the first line is the total number for documents [M]. Each line
after that is one document. [documenti] is the ith document of the dataset
that consists of a list of Ni words/terms.
[documenti] = [wordi1] [wordi2] ... [wordiNi]
in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated
by the blank character.
Initialize the corpus from a file.
id2word and line2words are optional parameters. If provided, id2word is a dictionary mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed from the documents.
line2words is a function which converts lines into tokens. Defaults to simple splitting on spaces.
docbyoffset
(offset)¶Return the document stored at file position offset.
id2word
¶line2doc
(line)¶load
(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
save
(*args, **kwargs)¶save_corpus
(fname, corpus, id2word=None, metadata=False)¶Save a corpus in the List-of-words format.
This function is automatically called by LowCorpus.serialize; don’t call it directly, call serialize instead.
serialize
(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
gensim.corpora.lowcorpus.
split_on_space
(s)¶