gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

utils – Various utility functions

utils – Various utility functions

This module contains various general utility functions.

class gensim.utils.ClippedCorpus(corpus, max_docs=None)

Bases: gensim.utils.SaveLoad

Return a corpus that is the “head” of input iterable corpus.

Any documents after max_docs are ignored. This effectively limits the length of the returned corpus to <= max_docs. Set max_docs=None for “no limit”, effectively wrapping the entire input corpus.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.utils.FakeDict(num_terms)

Bases: object

Objects of this class act as dictionaries that map integer->str(integer), for a specified range of integers <0, num_terms).

This is meant to avoid allocating real dictionaries when num_terms is huge, which is a waste of memory.

get(val, default=None)
iteritems()
keys()

Override the dict.keys() function, which is used to determine the maximum internal id of a corpus = the vocabulary dimensionality.

HACK: To avoid materializing the whole range(0, self.num_terms), this returns the highest id = [self.num_terms - 1] only.

class gensim.utils.InputQueue(q, corpus, chunksize, maxsize, as_numpy)

Bases: multiprocessing.process.Process

authkey
daemon

Return whether process is a daemon

exitcode

Return exit code of process or None if it has yet to stop

ident

Return identifier (PID) of process or None if it has yet to start

is_alive()

Return whether process is alive

join(timeout=None)

Wait until child process terminates

name
pid

Return identifier (PID) of process or None if it has yet to start

run()
start()

Start child process

terminate()

Terminate process; sends SIGTERM signal or uses TerminateProcess()

class gensim.utils.NoCM

Bases: object

acquire()
release()
class gensim.utils.RepeatCorpus(corpus, reps)

Bases: gensim.utils.SaveLoad

Used in the tutorial on distributed computing and likely not useful anywhere else.

Wrap a corpus as another corpus of length reps. This is achieved by repeating documents from corpus over and over again, until the requested length len(result)==reps is reached. Repetition is done on-the-fly=efficiently, via itertools.

>>> corpus = [[(1, 0.5)], []] # 2 documents
>>> list(RepeatCorpus(corpus, 5)) # repeat 2.5 times to get 5 documents
[[(1, 0.5)], [], [(1, 0.5)], [], [(1, 0.5)]]
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.utils.RepeatCorpusNTimes(corpus, n)

Bases: gensim.utils.SaveLoad

Repeat a corpus n times.

>>> corpus = [[(1, 0.5)], []]
>>> list(RepeatCorpusNTimes(corpus, 3)) # repeat 3 times
[[(1, 0.5)], [], [(1, 0.5)], [], [(1, 0.5)], []]
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.utils.SaveLoad

Bases: object

Objects which inherit from this class have save/load functions, which un/pickle them to disk.

This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.

classmethod load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.utils.SlicedCorpus(corpus, slice_)

Bases: gensim.utils.SaveLoad

Return a corpus that is the slice of input iterable corpus.

Negative slicing can only be used if the corpus is indexable. Otherwise, the corpus will be iterated over.

Slice can also be a numpy.ndarray to support fancy indexing.

NOTE: calculating the size of a SlicedCorpus is expensive when using a slice as the corpus has to be iterated over once. Using a list or numpy.ndarray does not have this drawback, but consumes more memory.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

gensim.utils.any2unicode(text, encoding='utf8', errors='strict')

Convert a string (bytestring in encoding or unicode), to unicode.

gensim.utils.any2utf8(text, errors='strict', encoding='utf8')

Convert a string (unicode or bytestring in encoding), to bytestring in utf8.

gensim.utils.check_output(*popenargs, **kwargs)

Run command with arguments and return its output as a byte string. Backported from Python 2.7 as it’s implemented as pure python on stdlib. >>> check_output([‘/usr/bin/python’, ‘–version’]) Python 2.6.2 Added extra KeyboardInterrupt handling

gensim.utils.chunkize(corpus, chunksize, maxsize=0, as_numpy=False)

Split a stream of values into smaller chunks. Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok, chunking is done efficiently via itertools.

If maxsize > 1, don’t wait idly in between successive chunk yields, but rather keep filling a short queue (of size at most maxsize) with forthcoming chunks in advance. This is realized by starting a separate process, and is meant to reduce I/O delays, which can be significant when corpus comes from a slow medium (like harddisk).

If maxsize==0, don’t fool around with parallelism and simply yield the chunksize via chunkize_serial() (no I/O optimizations).

>>> for chunk in chunkize(range(10), 4): print(chunk)
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
gensim.utils.chunkize_serial(iterable, chunksize, as_numpy=False)

Return elements from the iterable in chunksize-ed lists. The last returned element may be smaller (if length of collection is not divisible by chunksize).

>>> print(list(grouper(range(10), 3)))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

Recursively copy a directory ala shutils.copytree, but hardlink files instead of copying. Available on UNIX systems only.

gensim.utils.deaccent(text)

Remove accentuation from the given string. Input text is either a unicode string or utf8 encoded bytestring.

Return input string with accents removed, as unicode.

>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
gensim.utils.decode_htmlentities(text)

Decode HTML entities in text, coded as hex, decimal or named.

Adapted from http://github.com/sku/python-twitter-ircbot/blob/321d94e0e40d0acc92f5bf57d126b57369da70de/html_decode.py

>>> u = u'E tu vivrai nel terrore - L&#x27;aldil&#xE0; (1981)'
>>> print(decode_htmlentities(u).encode('UTF-8'))
E tu vivrai nel terrore - L'aldilà (1981)
>>> print(decode_htmlentities("l&#39;eau"))
l'eau
>>> print(decode_htmlentities("foo &lt; bar"))
foo < bar
gensim.utils.dict_from_corpus(corpus)

Scan corpus for all word ids that appear in it, then construct and return a mapping which maps each wordId -> str(wordId).

This function is used whenever words need to be displayed (as opposed to just their ids) but no wordId->word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest wordId found.

gensim.utils.file_or_filename(*args, **kwds)

Return a file-like object ready to be read from the beginning. input is either a filename (gz/bz2 also supported) or a file-like object supporting seek.

gensim.utils.getNS()

Return a Pyro name server proxy. If there is no name server running, start one on 0.0.0.0 (all interfaces), as a background process.

gensim.utils.get_max_id(corpus)

Return the highest feature id that appears in the corpus.

For empty corpora (no features at all), return -1.

gensim.utils.get_my_ip()

Try to obtain our external ip (from the pyro nameserver’s point of view)

This tries to sidestep the issue of bogus /etc/hosts entries and other local misconfigurations, which often mess up hostname resolution.

If all else fails, fall back to simple socket.gethostbyname() lookup.

gensim.utils.grouper(iterable, chunksize, as_numpy=False)

Return elements from the iterable in chunksize-ed lists. The last returned element may be smaller (if length of collection is not divisible by chunksize).

>>> print(list(grouper(range(10), 3)))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
gensim.utils.has_pattern()

Function to check if there is installed pattern library

gensim.utils.identity(p)

Identity fnc, for flows that don’t accept lambda (pickling etc).

gensim.utils.is_corpus(obj)

Check whether obj is a corpus. Return (is_corpus, new) 2-tuple, where new is obj if obj was an iterable, or new yields the same sequence as obj if it was an iterator.

obj is a corpus if it supports iteration over documents, where a document is in turn anything that acts as a sequence of 2-tuples (int, float).

Note: An “empty” corpus (empty input sequence) is ambiguous, so in this case the result is forcefully defined as is_corpus=False.

gensim.utils.keep_vocab_item(word, count, min_count, trim_rule=None)
gensim.utils.lemmatize(content, allowed_tags=<_sre.SRE_Pattern object>, light=False, stopwords=frozenset([]), min_length=2, max_length=15)

This function is only available when the optional ‘pattern’ package is installed.

Use the English lemmatizer from pattern to extract UTF8-encoded tokens in their base form=lemma, e.g. “are, is, being” -> “be” etc. This is a smarter version of stemming, taking word context into account.

Only considers nouns, verbs, adjectives and adverbs by default (=all other lemmas are discarded).

>>> lemmatize('Hello World! How is it going?! Nonexistentword, 21')
['world/NN', 'be/VB', 'go/VB', 'nonexistentword/NN']
>>> lemmatize('The study ranks high.')
['study/NN', 'rank/VB', 'high/JJ']
>>> lemmatize('The ranks study hard.')
['rank/NN', 'study/VB', 'hard/RB']
gensim.utils.mock_data(n_items=1000, dim=1000, prob_nnz=0.5, lam=1.0)

Create a random gensim-style corpus, as a list of lists of (int, float) tuples, to be used as a mock corpus.

gensim.utils.mock_data_row(dim=1000, prob_nnz=0.5, lam=1.0)

Create a random gensim sparse vector. Each coordinate is nonzero with probability prob_nnz, each non-zero coordinate value is drawn from a Poisson distribution with parameter lambda equal to lam.

gensim.utils.pickle(obj, fname, protocol=2)

Pickle object obj to file fname.

protocol defaults to 2 so pickled objects are compatible across Python 2.x and 3.x.

gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None)

Remove all entries from the vocab dictionary with count smaller than min_reduce.

Modifies vocab in place, returns the sum of all counts that were pruned.

gensim.utils.pyro_daemon(name, obj, random_suffix=False, ip=None, port=None)

Register object with name server (starting the name server if not running yet) and block until the daemon is terminated. The object is registered under name, or name`+ some random suffix if `random_suffix is set.

gensim.utils.qsize(queue)

Return the (approximate) queue size where available; -1 where not (OS X).

gensim.utils.randfname(prefix='gensim')
gensim.utils.revdict(d)

Reverse a dictionary mapping.

When two keys map to the same value, only one of them will be kept in the result (which one is kept is arbitrary).

gensim.utils.safe_unichr(intval)
gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15)

Convert a document into a list of tokens.

This lowercases, tokenizes, de-accents (optional). – the output are final tokens = unicode strings, that won’t be processed any further.

gensim.utils.smart_extension(fname, ext)
gensim.utils.synchronous(tlockname)

A decorator to place an instance-based lock around a method.

Adapted from http://code.activestate.com/recipes/577105-synchronization-decorator-for-class-methods/

gensim.utils.to_unicode(text, encoding='utf8', errors='strict')

Convert a string (bytestring in encoding or unicode), to unicode.

gensim.utils.to_utf8(text, errors='strict', encoding='utf8')

Convert a string (unicode or bytestring in encoding), to bytestring in utf8.

gensim.utils.tokenize(text, lowercase=False, deacc=False, errors='strict', to_lower=False, lower=False)

Iteratively yield tokens as unicode strings, removing accent marks and optionally lowercasing the unidoce string by assigning True to one of the parameters, lowercase, to_lower, or lower.

Input text may be either unicode or utf8-encoded byte string.

The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).

>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']
gensim.utils.toptexts(query, texts, index, n=10)

Debug fnc to help inspect the top n most similar documents (according to a similarity index index), to see if they are actually related to the query.

texts is any object that can return something insightful for each document via texts[docid], such as its fulltext or snippet.

Return a list of 3-tuples (docid, doc’s similarity to the query, texts[docid]).

gensim.utils.unpickle(fname)

Load pickled object from fname

gensim.utils.upload_chunked(server, docs, chunksize=1000, preprocess=None)

Memory-friendly upload of documents to a SimServer (or Pyro SimServer proxy).

Use this function to train or index large collections – avoid sending the entire corpus over the wire as a single Pyro in-memory object. The documents will be sent in smaller chunks, of chunksize documents each.