gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

corpora.textcorpus – Building corpora with dictionaries

corpora.textcorpus – Building corpora with dictionaries

Text corpora usually reside on disk, as text files in one format or another In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= sequences of (word_id, word_weight) 2-tuples).

This module provides some code scaffolding to simplify this pipeline. For example, given a corpus where each document is a separate line in file on disk, you would override the TextCorpus.get_texts method to read one line=document at a time, process it (lowercase, tokenize, whatever) and yield it as a sequence of words.

Overriding get_texts is enough; you can then initialize the corpus with e.g. MyTextCorpus(bz2.BZ2File(‘mycorpus.txt.bz2’)) and it will behave correctly like a corpus of sparse vectors. The __iter__ methods is automatically set up, and dictionary is automatically populated with all word->id mappings.

The resulting object can be used as input to all gensim models (TFIDF, LSI, ...), serialized with any format (Matrix Market, SvmLight, Blei’s LDA-C format etc).

See the gensim.test.test_miislita.CorpusMiislita class for a simple example.

class gensim.corpora.textcorpus.TextCorpus(input=None)

Bases: gensim.interfaces.CorpusABC

Helper class to simplify the pipeline of getting bag-of-words vectors (= a gensim corpus) from plain text.

This is an abstract base class: override the get_texts() and __len__() methods to match your particular input.

Given a filename (or a file-like object) in constructor, the corpus object will be automatically initialized with a dictionary in self.dictionary and will support the iter corpus method. You must only provide a correct get_texts implementation.

get_texts()

Iterate over the collection, yielding one document at a time. A document is a sequence of words (strings) that can be fed into Dictionary.doc2bow.

Override this function to match your input (parse input files, do any text preprocessing, lowercasing, tokenizing etc.). There will be no further preprocessing of the words coming out of this function.

getstream()
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(*args, **kwargs)
save_corpus(fname, corpus, id2word=None, metadata=False)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.save_corpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, save_corpus is automatically called internally by serialize, which does save_corpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents

Calling serialize() is preferred to calling save_corpus().