Ignis: Iterative Topic Modelling Platform
ignis
is an extensible platform that provides a common interface for creating and visualising topic models.
By default, it supports creating LDA models using Tomotopy (https://bab2min.github.io/tomotopy/) and visualising them using pyLDAvis (https://github.com/bmabey/pyLDAvis), but support for other models and frameworks can be written in as necessary.
API documentation is available at https://zechyw.github.io/ignis-tm/ignis/.
Installation
The library package is named ignis-tm
on PyPI, so to use it in a project, first install the ignis-tm
package:
pip install ignis-tm
After installation, import and use the library as ignis
in your code:
import ignis
Demonstration/Development Environment Walkthrough
A full demonstration/development environment can be easily set up using Python 3.7 and pipenv
.
Clone the repository
Start by cloning the repository and navigating to the root folder of the codebase:
git clone https://github.com/ZechyW/ignis-tm.git
cd ignis-tm
Install the project dependencies
Install pipenv
and use it to install the other dependencies:
pip install pipenv
pipenv install --dev
The pipenv
environment can then be activated from the codebase root:
pipenv shell
The pipenv
environment will always need to be activated before the demo Jupyter notebooks can be used.
Perform post-installation steps
The full demonstration setup includes a number of Jupyter plugins under its dev dependencies that could be useful for working with the sample notebooks.
With the demo environment activated, install and configure the plugins:
jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user
You can then configure the Jupyter notebook extensions directly from the web-based Jupyter UI.
In particular, see https://neuralcoder.science/Black-Jupyter/ for a guide to setting up the Code Prettify extension using black
.
The ExecuteTime extension is also useful for tracking cell execution times.
You will also need to download the Spacy en_core_web_sm
package if you intend to perform lemmatisation on your data:
python -m spacy download en_core_web_sm
Run the sample notebooks
Once the installation is complete, you can spin up a jupyter notebook instance (be sure to activate the pipenv
environment if necessary):
jupyter notebook
Then go through the self-documented Ignis Corpus
and Ignis LDA
notebooks to explore the BBC news dataset.
Other Notes
Random seeds and indeterminacy
N.B.: The behaviour described below should be fixed in Tomotopy >= 0.9.1, which uses a different random number generation scheme. Note that models created with Tomotopy < 0.9.1 might therefore differ from newer models even if the same seed is set.
Some dependencies that perform non-deterministic operations (e.g., Tomotopy, Gensim) may need PYTHONHASHSEED
to be set in order to consistently reproduce results.
To be safe, PYTHONHASHSEED
should be explicitly set where necessary.
If using a Conda environment, this can be done with:
conda env config vars set PYTHONHASHSEED=<seed>
For direct invocation:
PYTHONHASHSEED=<seed> python script.py
For Jupyter notebooks in a non-Conda environment, edit the Jupyter kernel.json
to add an appropriate env
key.
Miscellaneous notes on dependencies
The ipython
and jedi
packages are pinned to specific versions in the demo pipenv
environment to ensure their compatibility with extensions and code completion within Jupyter notebooks; unfortunately, they break with later versions due to a lack of upstream updates.
Changes
-
1.6.5 (18 June 2021)
-
1.5.0 (1 June 2021)
- General functionality update to match development version; enhancements and improvements across the board.
- Updated demo walkthrough notebooks.
Expand source code
"""
.. include:: ../README.md
"""
import ignis.aurum
import ignis.corpus
import ignis.probat
from ignis.__version__ import __version__
Corpus = ignis.corpus.Corpus
load_corpus = ignis.corpus.load_corpus
load_slice = ignis.corpus.load_slice
train_model = ignis.probat.train_model
suggest_num_topics = ignis.probat.suggest_num_topics
load_results = ignis.aurum.load_results
# Documentation
__pdoc__ = {"tests": False}
__all__ = [
"load_corpus",
"load_slice",
"train_model",
"suggest_num_topics",
"load_results",
]
Sub-modules
ignis.aurum
-
Aurum
instances manage the results of topic modelling runs, and provide methods for exploring and iterating over them. ignis.corpus
-
Corpus
andCorpusSlice
instances are containers for tracking theDocument
objects in a dataset. ignis.labeller
-
Classes for working with automated topic labellers. End-users should not need to use these classes directly.
ignis.models
-
Classes for working with topic modelling algorithms. End-users should not need to use these classes directly.
ignis.probat
-
Functions for performing topic modelling to get
Aurum
results. ignis.util
-
General utility classes that are not directly part of the topic modelling functionality provided by
ignis
. ignis.vis
-
Classes for working with topic modelling result visualisations. End-users should not need to use these classes directly.
Functions
def load_corpus(filename)
-
Loads a
Corpus
object from the given file.Parameters
filename
:str
orpathlib.Path
- The file to load the
Corpus
object from.
Returns
Expand source code
def load_corpus(filename): """ Loads a `ignis.corpus.Corpus` object from the given file. Parameters ---------- filename: str or pathlib.Path The file to load the `ignis.corpus.Corpus` object from. Returns ------- ignis.corpus.Corpus """ with bz2.open(filename, "rb") as fp: loaded = pickle.load(fp) if not type(loaded) is Corpus: raise ValueError(f"File does not contain a `Corpus` object: '{filename}'") # Re-initialise the `Corpus` with all `Documents` and the dynamic stop word list. new_corpus = Corpus() if hasattr(loaded, "_stop_words"): # Copy stop words, if they are set new_corpus._stop_words = set(loaded._stop_words) if hasattr(loaded, "documents"): # Old version - `.documents` was directly accessible. docs = loaded.documents.values() else: # New version - documents can only be retrieved via `get_document()` docs = loaded._documents.values() for doc in docs: # `Document` objects are un-pickled separately; `Document.__setstate__()` # ensures that the `raw_tokens` attribute is set appropriately. new_corpus.add_doc( doc.raw_tokens, doc.metadata, doc.display_str, doc.plain_text ) return new_corpus
def load_results(filename)
-
Loads an
Aurum
results object from the given file.Parameters
filename
:str
orpathlib.Path
- The file to load the
Aurum
object from.
Returns
Expand source code
def load_results(filename): """ Loads an `ignis.aurum.Aurum` results object from the given file. Parameters ---------- filename: str or pathlib.Path The file to load the `ignis.aurum.Aurum` object from. Returns ------- ignis.aurum.Aurum """ with bz2.open(filename, "rb") as fp: save_object = pickle.load(fp) # Rehydrate the Ignis/external models model_type = save_object["model_type"] save_model = save_object["save_model"] external_model_bytes = save_object["external_model_bytes"] if model_type[:3] == "tp_": # Tomotopy model external_model = _load_tomotopy_model(model_type, external_model_bytes) else: raise ValueError(f"Unknown model type: '{model_type}'") save_model.model = external_model # Rehydrate the Aurum object aurum = Aurum(save_model) aurum.vis_type = save_object["vis_type"] aurum.vis_options = save_object["vis_options"] aurum.vis_data = save_object["vis_data"] return aurum
def load_slice(filename)
-
Loads a
CorpusSlice
object from the given file.Parameters
filename
:str
orpathlib.Path
- The file to load the
CorpusSlice
object from.
Returns
Expand source code
def load_slice(filename): """ Loads a `ignis.corpus.CorpusSlice` object from the given file. Parameters ---------- filename: str or pathlib.Path The file to load the `ignis.corpus.CorpusSlice` object from. Returns ------- ignis.corpus.CorpusSlice """ with bz2.open(filename, "rb") as fp: loaded = pickle.load(fp) if not type(loaded) is CorpusSlice: raise ValueError(f"File does not contain a `CorpusSlice` object: '{filename}'") return loaded
def suggest_num_topics(*args, verbose=True, **kwargs)
-
Convenience function for running
compare_topic_count_coherence()
and directly reporting the topic count with the highest coherence found.Parameters
verbose
:bool
, optional- Whether or not to print the details of the best topic count.
*args
,**kwargs
- Passed on to
compare_topic_count_coherence()
.
Returns
int
- The suggested topic count.
Expand source code
def suggest_num_topics(*args, verbose=True, **kwargs): """ Convenience function for running `compare_topic_count_coherence()` and directly reporting the topic count with the highest coherence found. Parameters ---------- verbose: bool, optional Whether or not to print the details of the best topic count. *args, **kwargs Passed on to `compare_topic_count_coherence()`. Returns ------- int The suggested topic count. """ results = compare_topic_count_coherence(*args, verbose=verbose, **kwargs) results = sorted(results, key=lambda x: x[1], reverse=True) best = results[0] if verbose: print(f"Suggested topic count: {best[0]}\t" f"Coherence: {best[1]}") return best[0]
def train_model(corpus_slice, pre_model=None, model_type='tp_lda', model_options=None, labeller_type=None, labeller_options=None, vis_type='pyldavis', vis_options=None)
-
Top-level helper for training topic models using the various algorithms available.
Parameters
corpus_slice
:Corpus
orCorpusSlice
- The
CorpusSlice
to perform the topic modelling over. If aCorpus
is passed instead, aCorpusSlice
containing all of itsDocument
objects will be created. pre_model
:LDAModel
, optional- This is needed when you want to train a
tomotopy
LDA model with word priors. Default isNone
. model_type
:{"tp_lda", "tp_hdp", "tp_lda_wp"}
- Type of model to train; corresponds to the model type listed in the relevant
ignis.models
class. model_options
:dict
, optional- Dictionary of options that will be passed to the relevant
ignis.models
model constructor. labeller_type
:{"tomotopy"}
, optional- The type of automated labeller to use, if available.
labeller_options
:dict
, optional- Dictionary of options that will be passed to the relevant
ignis.labeller
object constructor. vis_type
:{"pyldavis"}
, optional- The type of visualisation data to extract, if available.
vis_options
:dict
, optional- Dictionary of options that will be passed to the relevant
ignis.vis
object constructor.
Returns
Expand source code
def train_model( corpus_slice, pre_model=None, model_type="tp_lda", model_options=None, labeller_type=None, labeller_options=None, vis_type="pyldavis", vis_options=None, ): """ Top-level helper for training topic models using the various algorithms available. Parameters ---------- corpus_slice: ignis.corpus.Corpus or ignis.corpus.CorpusSlice The `ignis.corpus.CorpusSlice` to perform the topic modelling over. If a `ignis.corpus.Corpus` is passed instead, a `ignis.corpus.CorpusSlice` containing all of its `ignis.corpus.Document` objects will be created. pre_model: ignis.models.lda.LDAModel, optional This is needed when you want to train a `tomotopy` LDA model with word priors. Default is `None`. model_type: {"tp_lda", "tp_hdp", "tp_lda_wp"} Type of model to train; corresponds to the model type listed in the relevant `ignis.models` class. model_options: dict, optional Dictionary of options that will be passed to the relevant `ignis.models` model constructor. labeller_type: {"tomotopy"}, optional The type of automated labeller to use, if available. labeller_options: dict, optional Dictionary of options that will be passed to the relevant `ignis.labeller` object constructor. vis_type: {"pyldavis"}, optional The type of visualisation data to extract, if available. vis_options: dict, optional Dictionary of options that will be passed to the relevant `ignis.vis` object constructor. Returns ------- ignis.aurum.Aurum The `ignis.aurum.Aurum` results object for the trained model, which can be used for further exploration and iteration. """ if type(corpus_slice) is ignis.corpus.Corpus: corpus_slice = corpus_slice.slice_full() if not type(corpus_slice) is ignis.corpus.CorpusSlice: raise ValueError( "Ignis models must be instantiated with Corpus or CorpusSlice instances." ) if model_type == "tp_lda": model = ignis.models.LDAModel(corpus_slice, model_options) model.train() aurum = ignis.aurum.Aurum(model) elif model_type == "tp_lda_wp": # Tomotopy LDA model with word priors # (contrib.: C. Ow) if isinstance(pre_model, ignis.models.lda.LDAModel): model = pre_model else: raise ValueError( "Ignis models with word priors must be pre-instantiated " "`ignis.models.lda.LDAModel` instances." ) model.train() aurum = ignis.aurum.Aurum(model) elif model_type == "tp_hdp": model = ignis.models.HDPModel(corpus_slice, model_options) model.train() aurum = ignis.aurum.Aurum(model) else: raise ValueError(f"Unknown model type: '{model_type}'") if labeller_type is not None: if labeller_options is None: labeller_options = {} aurum.init_labeller(labeller_type, **labeller_options) if vis_type is not None: if vis_options is None: vis_options = {} aurum.init_vis(vis_type, **vis_options) return aurum