DMM#
- class tweetopic.dmm.DMM(n_components: int, n_iterations: int = 50, alpha: float = 0.1, beta: float = 0.1)#
Implementation of the Dirichlet Mixture Model with Gibbs Sampling solver. The class aims to achieve full compatibility with sklearn.
- Parameters:
n_components (
int
) – Number of mixture components in the model.n_iterations (
int
, default50
) – Number of iterations during fitting. If you find your results are unsatisfactory, increase this number.alpha (
float
, default0.1
) – Willingness of a document joining an empty cluster.beta (
float
, default0.1
) – Willingness to join clusters, where the terms in the document are not present.
- components_#
Describes all components of the topic distribution. Contains the amount each word has been assigned to each component during fitting.
- Type:
array
ofshape (n_components
,n_vocab)
- cluster_doc_count#
Array containing how many documents there are in each cluster.
- Type:
array
ofshape (n_components,)
- n_features_in_#
Number of total vocabulary items seen during fitting.
- Type:
int
- n_documents#
Total number of documents seen during fitting.
- Type:
int
- max_unique_words#
Maximum number of unique words in a document seen during fitting.
- Type:
int
- get_params(deep: bool = False) dict #
Get parameters for this estimator.
- Parameters:
deep (
bool
, defaultFalse
) – Ignored, exists for sklearn compatibility.- Returns:
Parameter names mapped to their values.
- Return type:
dict
Note
Exists for sklearn compatibility.
- set_params(**params) DMM #
Set parameters for this estimator.
- Returns:
Estimator instance
- Return type:
Note
Exists for sklearn compatibility.
- fit(X: spmatrix | ArrayLike, y: None = None)#
Fits the model using Gibbs Sampling. Detailed description of the algorithm in Yin and Wang (2014).
- Parameters:
X (
array-like
orsparse matrix
ofshape (n_samples
,n_features)
) – BOW matrix of corpus.y (
None
) – Ignored, exists for sklearn compatibility.
- Returns:
The fitted model.
- Return type:
Note
fit() works in-place too, the fitted model is returned for convenience.
- transform(X: spmatrix | ArrayLike) ndarray #
Predicts probabilities for each document belonging to each component.
- Parameters:
X (
array-like
orsparse matrix
ofshape (n_samples
,n_features)
) – Document-term matrix.- Returns:
Probabilities for each document belonging to each cluster.
- Return type:
array
ofshape (n_samples
,n_components)
- Raises:
NotFittedException – If the model is not fitted, an exception will be raised
- predict_proba(X: spmatrix | ArrayLike) ndarray #
Alias of
transform()
.Mainly exists for compatibility with density estimators in sklearn.
- predict(X: spmatrix | ArrayLike) ndarray #
Predicts cluster labels for a set of documents. Mainly exists for compatibility with density estimators in sklearn.
- Parameters:
X (
array-like
orsparse matrix
ofshape (n_samples
,n_features)
) – Document-term matrix.- Returns:
Cluster label for each document.
- Return type:
array
ofshape (n_samples,)
- Raises:
NotFittedException – If the model is not fitted, an exception will be raised
- fit_transform(X: spmatrix | ArrayLike, y: None = None) ndarray #
Fits the model, then transforms the given data.
- Parameters:
X (
array-like
orsparse matrix
ofshape (n_samples
,n_features)
) – Document-term matrix.y (
None
) – Ignored, sklearn compatibility.
- Returns:
Probabilities for each document belonging to each cluster.
- Return type:
array
ofshape (n_samples
,n_components)