corpora.svmlightcorpus
– Corpus in SVMlight format¶Corpus in SVMlight format.
gensim.corpora.svmlightcorpus.
SvmLightCorpus
(fname, store_labels=True)¶Bases: gensim.corpora.indexedcorpus.IndexedCorpus
Corpus in SVMlight format.
Quoting http://svmlight.joachims.org/: The input file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>
The “qid” feature (used for SVMlight ranking), if present, is ignored.
Although not mentioned in the specification above, SVMlight also expect its feature ids to be 1-based (counting starts at 1). We convert features to 0-base internally by decrementing all ids when loading a SVMlight input file, and increment them again when saving as SVMlight.
Initialize the corpus from a file.
Although vector labels (~SVM target class) are not used in gensim in any way, they are parsed and stored in self.labels for convenience. Set store_labels=False to skip storing these labels (e.g. if there are too many vectors to store the self.labels array in memory).
doc2line
(doc, label=0)¶Output the document in SVMlight format, as a string. Inverse function to line2doc.
docbyoffset
(offset)¶Return the document stored at file position offset.
line2doc
(line)¶Create a document from a single line (string) in SVMlight format
load
(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
save
(*args, **kwargs)¶save_corpus
(fname, corpus, id2word=None, labels=False, metadata=False)¶Save a corpus in the SVMlight format.
The SVMlight <target> class tag is taken from the labels array, or set to 0 for all documents if labels is not supplied.
This function is automatically called by SvmLightCorpus.serialize; don’t call it directly, call serialize instead.
serialize
(serializer, fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print(mm[42]) # retrieve document no. 42, etc.