DMM#

Usage guide

class tweetopic.dmm.DMM(n_components: int, n_iterations: int = 50, alpha: float = 0.1, beta: float = 0.1)#

Implementation of the Dirichlet Mixture Model with Gibbs Sampling solver. The class aims to achieve full compatibility with sklearn.

Parameters:
  • n_components (int) – Number of mixture components in the model.

  • n_iterations (int, default 50) – Number of iterations during fitting. If you find your results are unsatisfactory, increase this number.

  • alpha (float, default 0.1) – Willingness of a document joining an empty cluster.

  • beta (float, default 0.1) – Willingness to join clusters, where the terms in the document are not present.

components_#

Describes all components of the topic distribution. Contains the amount each word has been assigned to each component during fitting.

Type:

array of shape (n_components, n_vocab)

cluster_doc_count#

Array containing how many documents there are in each cluster.

Type:

array of shape (n_components,)

n_features_in_#

Number of total vocabulary items seen during fitting.

Type:

int

n_documents#

Total number of documents seen during fitting.

Type:

int

max_unique_words#

Maximum number of unique words in a document seen during fitting.

Type:

int

get_params(deep: bool = False) dict#

Get parameters for this estimator.

Parameters:

deep (bool, default False) – Ignored, exists for sklearn compatibility.

Returns:

Parameter names mapped to their values.

Return type:

dict

Note

Exists for sklearn compatibility.

set_params(**params) DMM#

Set parameters for this estimator.

Returns:

Estimator instance

Return type:

DMM

Note

Exists for sklearn compatibility.

fit(X: spmatrix | ArrayLike, y: None = None)#

Fits the model using Gibbs Sampling. Detailed description of the algorithm in Yin and Wang (2014).

Parameters:
  • X (array-like or sparse matrix of shape (n_samples, n_features)) – BOW matrix of corpus.

  • y (None) – Ignored, exists for sklearn compatibility.

Returns:

The fitted model.

Return type:

DMM

Note

fit() works in-place too, the fitted model is returned for convenience.

transform(X: spmatrix | ArrayLike) ndarray#

Predicts probabilities for each document belonging to each component.

Parameters:

X (array-like or sparse matrix of shape (n_samples, n_features)) – Document-term matrix.

Returns:

Probabilities for each document belonging to each cluster.

Return type:

array of shape (n_samples, n_components)

Raises:

NotFittedException – If the model is not fitted, an exception will be raised

predict_proba(X: spmatrix | ArrayLike) ndarray#

Alias of transform() .

Mainly exists for compatibility with density estimators in sklearn.

predict(X: spmatrix | ArrayLike) ndarray#

Predicts cluster labels for a set of documents. Mainly exists for compatibility with density estimators in sklearn.

Parameters:

X (array-like or sparse matrix of shape (n_samples, n_features)) – Document-term matrix.

Returns:

Cluster label for each document.

Return type:

array of shape (n_samples,)

Raises:

NotFittedException – If the model is not fitted, an exception will be raised

fit_transform(X: spmatrix | ArrayLike, y: None = None) ndarray#

Fits the model, then transforms the given data.

Parameters:
  • X (array-like or sparse matrix of shape (n_samples, n_features)) – Document-term matrix.

  • y (None) – Ignored, sklearn compatibility.

Returns:

Probabilities for each document belonging to each cluster.

Return type:

array of shape (n_samples, n_components)