Pipelines#
To avoid data leakage and make it easier to operate with topic models, we recommend that you use scikit-learn’s Pipeline
Create a vectorizer and topic model:
from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
stop_words="english",
max_df=0.3,
min_df=15,
)
dmm = DMM(
n_components=15,
n_iterations=200,
alpha=0.1,
beta=0.2,
)
Add the two components to a tweetopic pipeline:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("vectorizer", vectorizer),
("dmm", dmm)
])
Fit pipelines on a stream of texts:
pipeline.fit(texts)