Dashboards#

If you have multiple embedding models, or you want to explore the same embeddings with different tools, and you still want to have them in the same web application, dashboards are here to help.

Dashboards are made up of a list of cards. Each card represents a page in the application.

Let’s say for example that you want to examine the semantic relations in a corpus from multiple perspectives. You could have a word-level GloVe model, and a semantic network app to go along with it. You also want to have document-level clustering/projection app, but you can’t decide whether you should use tf-idf representations or paragraph embeddings (Doc2Vec).

You can include all of these in a dashboard as cards. Let’s build all of this from scratch.

We will need gensim and glovpy, so let’s install those:

pip install glovpy gensim

First we load 20Newsgroups:

from sklearn.datasets import fetch_20newsgroups

# Loading the dataset
newsgroups = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
)
corpus = newsgroups.data

Let’s import all card types and initialize our cards to be an empty list:

from embedding_explorer.cards import NetworkCard, ClusteringCard

cards = []

Then let’s train a word embedding model and add it as a card to the dashboard.

from glovpy import GloVe
from gensim.utils import tokenize

# Tokenizing the dataset
tokenized_corpus = [
    list(tokenize(text, lower=True, deacc=True)) for text in corpus
]
# Training word embeddings
model = GloVe(vector_size=25)
model.train(tokenized_corpus)
# Adding a Semantic Network card to the dashboard
vocabulary = model.wv.index_to_key
embeddings = model.wv.vectors
cards.append(NetworkCard("GloVe Semantic Networks", corpus=vocabulary, embeddings=embeddings))

Next let’s extract tf-idf representations of documents and add a clustering card to our cards.

from sklearn.feature_extraction.text import TfidfVectorizer

# We are going to filter out stop words and all terms that occur in less than 10 documents.
embeddings = TfidfVectorizer(stop_words="english", min_df=10).fit_transform(corpus)
cards.append(ClusteringCard("tf-idf Clustering and Projection", embeddings=embeddings))

And for the last one we are going to train Doc2Vec representations.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tagged_corpus = [TaggedDocument(tokens, [i]) for i, tokens in enumerate(tokenized_corpus)]
model = Doc2Vec(tagged_corpus)
embeddings = model.dv.vectors
cards.append(ClusteringCard("Doc2Vec Clustering and Projection"))

Then let’s start the dashboard.

from embedding_explorer import show_dashboard

show_dashboard(cards)
Dashboard.

API Reference#

embedding_explorer.show_dashboard(cards: List[Card], port: int = 8050) Thread | None#

Show dashboard for all given word embeddings.

Parameters:
  • cards (list of Card) – Contains description of a model card that should appear in the dashboard.

  • port (int) – Port for the app to run on.

Returns:

If the app runs in a Jupyter notebook, work goes on on a background thread, this thread is returned.

Return type:

Thread or None

class embedding_explorer.cards.NetworkCard(name: str, corpus: Iterable[str], vectorizer: BaseEstimator | None = None, embeddings: ndarray | None = None, fuzzy_search: bool = False)#

Contains information about an embedding model card that should be displayed on the dashboard. This card will display the semantic network app when clicked on.

Parameters:
  • corpus (iterable of string) – Texts you intend to search in with the semantic explorer.

  • vectorizer (Transformer or None, default None) – Model to vectorize texts with. If not supplied the model is assumed to be a static word embedding model, and the embeddings parameter has to be supplied.

  • embeddings (ndarray of shape (n_corpus, n_features), default None) – Embeddings of the texts in the corpus. If not supplied, embeddings will be calculated using the vectorizer.

  • fuzzy_search (bool, default False) – Specifies whether you want to fuzzy search in the vocabulary. This is recommended for production use, but the index takes time to set up, therefore the startup time is expected to be greater.

class embedding_explorer.cards.ClusteringCard(name: str, corpus: Iterable[str] | None = None, vectorizer: BaseEstimator | None = None, embeddings: ndarray | None = None, metadata: DataFrame | None = None, hover_name: str | None = None, hover_data: Any | None = None)#

Contains information about an embedding model card that should be displayed on the dashboard. This card will display the clustering network app when clicked on.

Parameters:
  • corpus (iterable of string, optional) – Texts you intend to cluster.

  • vectorizer (TransformerMixin, optional) – Model to vectorize texts with.

  • embeddings (ndarray of shape (n_corpus, n_features), optional) – Embeddings of the texts in the corpus. If not supplied, texts will be encoded with the vectorizer

  • metadata (DataFrame, optional) – Metadata about the corpus or the embeddings. This is useful for filtering data points or changing visual properties of the main figure.

  • hover_name (str, optional) – Title to display when hovering on a data point. Has to be the name of a column in the metadata.

  • hover_data (list[str] or dict[str, bool], optional) – Additional data to display when hovering on a data point. Has to be a list of column names in the metadata, or a mapping of column names to booleans.

  • port (int) – Port for the app to run on.