Session

What is a Session?

A Session brings together a fixed set of Documents and a fixed set of Embeddings. It is the basis for creating an Index.

Creating a Session performs two important preprocessing steps of the data:

  • all text and embedding data is mapped to a fixed vocabulary
  • tokens are normalized according to given rules

Establishing a fixed vocabulary (from Documents and Embeddings) allows the Vectorian to build data structures that are highly optimized for ensuing operations.

More details on token normalization are given in the following section.

Token and Tag Normalization

A small but important parameter during Session creation is the normalizers option (which is usually set to "default").

  • normalizing tokens on a string level, e.g. lowercasing all tokens
  • ignoring certain tokens from all subsequent operations
  • unifying or mapping token POS tags

It is in these settings that users declare whether two tokens like "the" and "The" should be regarded identical or not. If they are to be unified, the selection of embedding vectors depends on the configured embedding's sampling setting (see Embeddings).

The Default Normalization

The Vectorian's default normalization is applies two sets of operations.

On the text level:

  • all non-word characters are removed from tokens (e.g. "has-" becomes "has")
  • if tokens do not contain at least one letter, they are ignored

On the POS tag level:

  • tokens with POS tag "PROPN" are mapped to POS tag "NOUN"
  • tokens with POS tag "PUNCT" (i.e. punctuation) are ignored

The motivation for rewriting "PROPN" tags is that these often pose a problem for tag-weighted alignments due to their rather high inaccuracy.

LabSession

LabSession is a specialization of Session that is specifically geared towards use in Jupyter. It offers the following advantages over Session:

  • displays a progress bar widget during performing queries
  • Results know how to render themselves in Jupyter as HTML