gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

similarities.docsim – Document similarity queries

similarities.docsim – Document similarity queries

This module contains functions and classes for computing similarities across a collection of documents in the Vector Space Model.

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.

You can later add new documents to the index via Similarity.add_documents().

How It Works

The Similarity class splits the index into several smaller sub-indexes (“shards”), which are disk-based. If your entire index fits in memory (~hundreds of thousands documents for 1GB of RAM), you can also use the MatrixSimilarity or SparseMatrixSimilarity classes directly. These are more simple but do not scale as well (they keep the entire index in RAM, no sharding).

Once the index has been initialized, you can query for document similarity simply by:

>>> index = Similarity('/tmp/tst', corpus, num_features=12) # build the index
>>> similarities = index[query] # get similarities between the query and all index documents

If you have more query documents, you can submit them all at once, in a batch:

>>> for similarities in index[batch_of_documents]: # the batch is simply an iterable of documents (=gensim corpus)
>>>     ...

The benefit of this batch (aka “chunked”) querying is much better performance. To see the speed-up on your machine, run python -m gensim.test.simspeed (compare to my results here).

There is also a special syntax for when you need similarity of documents in the index to the index itself (i.e. queries=indexed documents themselves). This special syntax uses the faster, batch queries internally and is ideal for all-vs-all pairwise similarities:

>>> for similarities in index: # yield similarities of the 1st indexed document, then 2nd...
>>>     ...
class gensim.similarities.docsim.MatrixSimilarity(corpus, num_best=None, dtype=<type 'numpy.float32'>, num_features=None, chunksize=256, corpus_len=None)

Compute similarity against a corpus of documents by storing the index matrix in memory. The similarity measure used is cosine between two vectors.

Use this if your input corpus contains dense vectors (such as documents in LSI space) and fits into RAM.

The matrix is internally stored as a dense numpy array. Unless the entire matrix fits into main memory, use Similarity instead.

See also Similarity and SparseMatrixSimilarity in this module.

num_features is the number of features in the corpus (will be determined automatically by scanning the corpus if not specified). See Similarity class for description of the other parameters.

get_similarities(query)

Return similarity of sparse vector query to all documents in the corpus, as a numpy array.

If query is a collection of documents, return a 2D array of similarities of each document in query to all documents in the corpus (=batch query, faster than processing each document in turn).

Do not use this function directly; use the self[query] syntax instead.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.similarities.docsim.Shard(fname, index)

A proxy class that represents a single shard instance within a Similarity index.

Basically just wraps (Sparse)MatrixSimilarity so that it mmaps from disk on request (query).

get_document_id(pos)

Return index vector at position pos.

The vector is of the same type as the underlying index (ie., dense for MatrixSimilarity and scipy.sparse for SparseMatrixSimilarity.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.similarities.docsim.Similarity(output_prefix, corpus, num_features, num_best=None, chunksize=256, shardsize=32768, norm='l2')

Compute cosine similarity of a dynamic query against a static corpus of documents (“the index”).

Scalability is achieved by sharding the index into smaller pieces, each of which fits into core memory (see the (Sparse)MatrixSimilarity classes in this module). The shards themselves are simply stored as files to disk and mmap’ed back as needed.

Construct the index from corpus. The index can be later extended by calling the add_documents method. Note: documents are split (internally, transparently) into shards of shardsize documents each, converted to a matrix, for faster BLAS calls. Each shard is stored to disk under output_prefix.shard_number (=you need write access to that location). If you don’t specify an output prefix, a random filename in temp will be used.

shardsize should be chosen so that a shardsize x chunksize matrix of floats fits comfortably into main memory.

num_features is the number of features in the corpus (e.g. size of the dictionary, or the number of latent topics for latent semantic models).

norm is the user-chosen normalization to use. Accepted values are: ‘l1’ and ‘l2’.

If num_best is left unspecified, similarity queries will return a full vector with one float for every document in the index:

>>> index = Similarity('/path/to/index', corpus, num_features=400) # if corpus has 7 documents...
>>> index[query] # ... then result will have 7 floats
[0.0, 0.0, 0.2, 0.13, 0.8, 0.0, 0.1]

If num_best is set, queries return only the num_best most similar documents, always leaving out documents for which the similarity is 0. If the input vector itself only has features with zero values (=the sparse representation is empty), the returned list will always be empty.

>>> index.num_best = 3
>>> index[query] # return at most "num_best" of `(index_of_document, similarity)` tuples
[(4, 0.8), (2, 0.13), (3, 0.13)]

You can also override num_best dynamically, simply by setting e.g. self.num_best = 10 before doing a query.

add_documents(corpus)

Extend the index with new documents.

Internally, documents are buffered and then spilled to disk when there’s self.shardsize of them (or when a query is issued).

check_moved()

Update shard locations, in case the server directory has moved on filesystem.

close_shard()

Force the latest shard to close (be converted to a matrix and stored to disk). Do nothing if no new documents added since last call.

NOTE: the shard is closed even if it is not full yet (its size is smaller than self.shardsize). If documents are added later via add_documents(), this incomplete shard will be loaded again and completed.

destroy()

Delete all files under self.output_prefix. Object is not usable after calling this method anymore. Use with care!

iter_chunks(chunksize=None)

Iteratively yield the index as chunks of documents, each of size <= chunksize.

The chunk is returned in its raw form (matrix or sparse matrix slice). The size of the chunk may be smaller than requested; it is up to the caller to check the result for real length, using chunk.shape[0].

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

query_shards(query)

Return the result of applying shard[query] for each shard in self.shards, as a sequence.

If PARALLEL_SHARDS is set, the shards are queried in parallel, using the multiprocessing module.

save(fname=None, *args, **kwargs)

Save the object via pickling (also see load) under filename specified in the constructor.

Calls close_shard internally to spill any unfinished shards to disk first.

similarity_by_id(docpos)

Return similarity of the given document only. docpos is the position of the query document within index.

vector_by_id(docpos)

Return indexed vector corresponding to the document at position docpos.

class gensim.similarities.docsim.SparseMatrixSimilarity(corpus, num_features=None, num_terms=None, num_docs=None, num_nnz=None, num_best=None, chunksize=500, dtype=<type 'numpy.float32'>, maintain_sparsity=False)

Compute similarity against a corpus of documents by storing the sparse index matrix in memory. The similarity measure used is cosine between two vectors.

Use this if your input corpus contains sparse vectors (such as documents in bag-of-words format) and fits into RAM.

The matrix is internally stored as a scipy.sparse.csr matrix. Unless the entire matrix fits into main memory, use Similarity instead.

Takes an optional maintain_sparsity argument, setting this to True causes get_similarities to return a sparse matrix instead of a dense representation if possible.

See also Similarity and MatrixSimilarity in this module.

get_similarities(query)

Return similarity of sparse vector query to all documents in the corpus, as a numpy array.

If query is a collection of documents, return a 2D array of similarities of each document in query to all documents in the corpus (=batch query, faster than processing each document in turn).

Do not use this function directly; use the self[query] syntax instead.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class gensim.similarities.docsim.WmdSimilarity(corpus, w2v_model, num_best=None, normalize_w2v_and_replace=True, chunksize=256)

Document similarity (like MatrixSimilarity) that uses the negative of WMD as a similarity measure. See gensim.models.word2vec.wmdistance for more information.

When a num_best value is provided, only the most similar documents are retrieved.

When using this code, please consider citing the following papers: * Ofir Pele and Michael Werman, “A linear time histogram metric for improved SIFT matching”. * Ofir Pele and Michael Werman, “Fast and robust earth mover’s distances”. * Matt Kusner et al. “From Word Embeddings To Document Distances”.

Example:

# Given a document collection “corpus”, train word2vec model. model = word2vec(corpus) instance = WmdSimilarity(corpus, model, num_best=10)

# Make query. sims = instance[query]

corpus: List of lists of strings, as in gensim.models.word2vec. w2v_model: A trained word2vec model. num_best: Number of results to retrieve. If provided, a fast algorithm

called “prefetch and prune” is used.
normalize_w2v_and_replace: Whether or not to normalize the word2vec vectors to
length 1.
get_similarities(query)

Do not use this function directly; use the self[query] syntax instead.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.