gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

simserver – Document similarity server

simserver – Document similarity server

“Find similar” service, using gensim (=vector spaces) for backend.

The server performs 3 main functions:

  1. converts documents to semantic representation (TF-IDF, LSA, LDA...)
  2. indexes documents in the vector representation, for faster retrieval
  3. for a given query document, return ids of the most similar documents from the index

SessionServer objects are transactional, so that you can rollback/commit an entire set of changes.

The server is ready for concurrent requests (thread-safe). Indexing is incremental and you can query the SessionServer even while it’s being updated, so that there is virtually no down-time.

class simserver.simserver.SessionServer(basedir, autosession=True, use_locks=True)

Similarity server on top of SimServer that implements sessions = transactions.

A transaction is a set of server modifications (index/delete/train calls) that may be either committed or rolled back entirely.

Sessions are realized by:

  1. cloning (=copying) a SimServer at the beginning of a session
  2. serving read-only queries from the original server (the clone may be modified during queries)
  3. modifications affect only the clone
  4. at commit, the clone becomes the original
  5. at rollback, do nothing (clone is discarded, next transaction starts from the original again)
buffer(*args, **kwargs)

Buffer documents, in the current session

check_session(*args, **kwargs)

Make sure a session is open.

If it’s not and autosession is turned on, create a new session automatically. If it’s not and autosession is off, raise an exception.

close(*args, **kwargs)

Don’t wait for gc, try to release important resources manually

commit(*args, **kwargs)

Commit changes made by the latest session.

delete(*args, **kwargs)

Delete documents from the current session.

drop_index(*args, **kwargs)

Drop all indexed documents from the session. Optionally, drop model too.

find_similar(*args, **kwargs)

Find similar articles.

With autosession off, use the index state before current session started, so that changes made in the session will not be visible here. With autosession on, close the current session first (so that session changes are committed and visible).

index(*args, **kwargs)

Index documents, in the current session

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

open_session(*args, **kwargs)

Open a new session to modify this server.

You can either call this fnc directly, or turn on autosession which will open/commit sessions for you transparently.

optimize(*args, **kwargs)

Optimize index for faster by-document-id queries.

rollback(*args, **kwargs)

Ignore all changes made in the latest session (terminate the session).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

set_autosession(*args, **kwargs)

Turn autosession (automatic committing after each modification call) on/off. If value is None, only query the current value (don’t change anything).

terminate(*args, **kwargs)

Delete all files created by this server, invalidating self. Use with care.

train(*args, **kwargs)

Update semantic model, in the current session.

class simserver.simserver.SimIndex(fname, num_features, shardsize=65536, topsims=100)

An index of documents. Used internally by SimServer.

It uses the Similarity class to persist all document vectors to disk (via mmap).

Spill index shards to disk after every shardsize documents. In similarity queries, return only the topsims most similar documents.

close()

Explicitly release important resources (file handles, db, ...)

delete(docids)

Delete documents (specified by their ids) from the index.

index_documents(fresh_docs, model)

Update fresh index with new documents (potentially replacing old ones with the same id). fresh_docs is a dictionary-like object (=dict, sqlitedict, shelve etc) that maps document_id->document.

index_vectors(vectors)

Update fresh index with new vectors. vectors is a dictionary-like object (=dict, sqlitedict, shelve etc) that maps document_id->vector.

merge(other)

Merge documents from the other index. Update precomputed similarities in the process.

sims2scores(sims, eps=1e-07)

Convert raw similarity vector to a list of (docid, similarity) results.

sims_by_id(docid)

Find the most similar documents to the (already indexed) document with docid.

sims_by_vec(vec, normalize=None)

Find the most similar documents to a given vector (=already processed document).

terminate()

Delete all files created by this index, invalidating self. Use with care.

update_ids(docids)

Update id->pos mapping with new document ids.

update_mappings()

Synchronize id<->position mappings.

vec_by_id(docid)

Return indexed vector corresponding to document docid.

class simserver.simserver.SimModel(fresh_docs, dictionary=None, method=None, params=None)

A semantic model responsible for translating between plain text and (semantic) vectors.

These vectors can then be indexed/queried for similarity, see the SimIndex class. Used internally by SimServer.

Train a model, using fresh_docs as training corpus.

If dictionary is not specified, it is computed from the documents.

method is currently one of “tfidf”/”lsi”/”lda”.

close()

Release important resources manually.

doc2vec(doc)

Convert a single SimilarityDocument to vector.

docs2vecs(docs)

Convert multiple SimilarityDocuments to vectors (batch version of doc2vec).

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

class simserver.simserver.SimServer(basename, use_locks=False)

Top-level functionality for similarity services. A similarity server takes care of:

  1. creating semantic models
  2. indexing documents using these models
  3. finding the most similar documents in an index.

An object of this class can be shared across network via Pyro, to answer remote client requests. It is thread safe. Using a server concurrently from multiple processes is safe for reading = answering similarity queries. Modifying (training/indexing) is realized via locking = serialized internally.

All data will be stored under directory basename. If there is a server there already, it will be loaded (resumed).

The server object is stateless in RAM – its state is defined entirely by its location. There is therefore no need to store the server object.

buffer(*args, **kwargs)

Add a sequence of documents to be processed (indexed or trained on).

Here, the documents are simply collected; real processing is done later, during the self.index or self.train calls.

buffer can be called repeatedly; the result is the same as if it was called once, with a concatenation of all the partial document batches. The point is to save memory when sending large corpora over network: the entire documents must be serialized into RAM. See utils.upload_chunked().

A call to flush() clears this documents-to-be-processed buffer (flush is also implicitly called when you call index() and train()).

close()

Explicitly close open file handles, databases etc.

delete(*args, **kwargs)

Delete specified documents from the index.

drop_index(*args, **kwargs)

Drop all indexed documents. If keep_model is False, also dropped the model.

find_similar(doc, min_score=0.0, max_results=100)

Find max_results most similar articles in the index, each having similarity score of at least min_score. The resulting list may be shorter than max_results, in case there are not enough matching documents.

doc is either a string (=document id, previously indexed) or a dict containing a ‘tokens’ key. These tokens are processed to produce a vector, which is then used as a query against the index.

The similar documents are returned in decreasing similarity order, as (doc_id, similarity_score, doc_payload) 3-tuples. The payload returned is identical to what was supplied for this document during indexing.

flush(*args, **kwargs)

Commit all changes, clear all caches.

index(*args, **kwargs)

Permanently index all documents previously added via buffer, or directly index documents from corpus, if specified.

The indexing model must already exist (see train) before this function is called.

keys()

Return ids of all indexed documents.

optimize(*args, **kwargs)

Precompute top similarities for all indexed documents. This speeds up find_similar queries by id (but not queries by fulltext).

Internally, documents are moved from a fresh index (=no precomputed similarities) to an optimized index (precomputed similarities). Similarity queries always query both indexes, so this split is transparent to clients.

If you add documents later via index, they go to the fresh index again. To precompute top similarities for these new documents too, simply call optimize again.

train(*args, **kwargs)

Create an indexing model. Will overwrite the model if it already exists. All indexes become invalid, because documents in them use a now-obsolete representation.

The model is trained on documents previously entered via buffer, or directly on corpus, if specified.

simserver.simserver.merge_sims(oldsims, newsims, clip=None)

Merge two precomputed similarity lists, truncating the result to clip most similar items.