corpora.ucicorpus
– Corpus in UCI bag-of-words format¶University of California, Irvine (UCI) Bag-of-Words format.
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
gensim.corpora.ucicorpus.
UciCorpus
(fname, fname_vocab=None)¶Bases: gensim.corpora.ucicorpus.UciReader
, gensim.corpora.indexedcorpus.IndexedCorpus
Corpus in the UCI bag-of-words format.
create_dictionary
()¶Utility method to generate gensim-style Dictionary directly from the corpus and vocabulary data.
docbyoffset
(offset)¶Return document at file offset offset (in bytes)
load
(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
save
(*args, **kwargs)¶save_corpus
(fname, corpus, id2word=None, progress_cnt=10000, metadata=False)¶Save a corpus in the UCI Bag-of-Words format.
There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.
This function is automatically called by UciCorpus.serialize; don’t call it directly, call serialize instead.
serialize
(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.
skip_headers
(input_file)¶gensim.corpora.ucicorpus.
UciReader
(input)¶Bases: gensim.matutils.MmReader
Initialize the reader.
The input parameter refers to a file on the local filesystem, which is expected to be in the UCI Bag-of-Words format.
docbyoffset
(offset)¶Return document at file offset offset (in bytes)
skip_headers
(input_file)¶gensim.corpora.ucicorpus.
UciWriter
(fname)¶Bases: gensim.matutils.MmWriter
Store a corpus in UCI Bag-of-Words format.
This corpus format is identical to MM format, except for different file headers. There is no format line, and the first three lines of the file contain number_docs, num_terms, and num_nnz, one value per line.
This implementation is based on matutils.MmWriter, and works the same way.
FAKE_HEADER
= ' \n'¶HEADER_LINE
= '%%MatrixMarket matrix coordinate real general\n'¶MAX_HEADER_LENGTH
= 20¶close
()¶fake_headers
(num_docs, num_terms, num_nnz)¶update_headers
(num_docs, num_terms, num_nnz)¶Update headers with actual values.
write_corpus
(fname, corpus, progress_cnt=1000, index=False)¶write_headers
()¶Write blank header lines. Will be updated later, once corpus stats are known.
write_vector
(docno, vector)¶Write a single sparse vector to the file.
Sparse vector is any iterable yielding (field id, field value) pairs.