Pipelines#

To avoid data leakage and make it easier to operate with topic models, we recommend that you use scikit-learn’s Pipeline

Create a vectorizer and topic model:

from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    stop_words="english",
    max_df=0.3,
    min_df=15,
)
dmm = DMM(
    n_components=15,
    n_iterations=200,
    alpha=0.1,
    beta=0.2,
)

Add the two components to a tweetopic pipeline:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("vectorizer", vectorizer),
    ("dmm", dmm)
])

Fit pipelines on a stream of texts:

pipeline.fit(texts)

Note

It is highly advisable to pre-process texts with an NLP library such as Spacy or NLTK. Removal of stop/function words and lemmatization could drastically improve the quality of topics.