Ignis: Iterative Topic Modelling Platform

ignis is an extensible platform that provides a common interface for creating and visualising topic models.

By default, it supports creating LDA models using Tomotopy (https://bab2min.github.io/tomotopy/) and visualising them using pyLDAvis (https://github.com/bmabey/pyLDAvis), but support for other models and frameworks can be written in as necessary.

API documentation is available at https://zechyw.github.io/ignis-tm/ignis/.

Installation

The library package is named ignis-tm on PyPI, so to use it in a project, first install the ignis-tm package:

pip install ignis-tm

After installation, import and use the library as ignis in your code:

import ignis

Demonstration/Development Environment Walkthrough

A full demonstration/development environment can be easily set up using Python 3.7 and pipenv.

Clone the repository

Start by cloning the repository and navigating to the root folder of the codebase:

git clone https://github.com/ZechyW/ignis-tm.git
cd ignis-tm

Install the project dependencies

Install pipenv and use it to install the other dependencies:

pip install pipenv
pipenv install --dev

The pipenv environment can then be activated from the codebase root:

pipenv shell

The pipenv environment will always need to be activated before the demo Jupyter notebooks can be used.

Perform post-installation steps

The full demonstration setup includes a number of Jupyter plugins under its dev dependencies that could be useful for working with the sample notebooks.

With the demo environment activated, install and configure the plugins:

jupyter contrib nbextension install --user
jupyter nbextensions_configurator enable --user

You can then configure the Jupyter notebook extensions directly from the web-based Jupyter UI. In particular, see https://neuralcoder.science/Black-Jupyter/ for a guide to setting up the Code Prettify extension using black. The ExecuteTime extension is also useful for tracking cell execution times.

You will also need to download the Spacy en_core_web_sm package if you intend to perform lemmatisation on your data:

python -m spacy download en_core_web_sm

Run the sample notebooks

Once the installation is complete, you can spin up a jupyter notebook instance (be sure to activate the pipenv environment if necessary):

jupyter notebook

Then go through the self-documented Ignis Corpus and Ignis LDA notebooks to explore the BBC news dataset.

Other Notes

Random seeds and indeterminacy

N.B.: The behaviour described below should be fixed in Tomotopy >= 0.9.1, which uses a different random number generation scheme.  Note that models created with Tomotopy < 0.9.1 might therefore differ from newer models even if the same seed is set.

Some dependencies that perform non-deterministic operations (e.g., Tomotopy, Gensim) may need PYTHONHASHSEED to be set in order to consistently reproduce results. To be safe, PYTHONHASHSEED should be explicitly set where necessary.

If using a Conda environment, this can be done with:

conda env config vars set PYTHONHASHSEED=<seed>

For direct invocation:

PYTHONHASHSEED=<seed> python script.py

For Jupyter notebooks in a non-Conda environment, edit the Jupyter kernel.json to add an appropriate env key.

Miscellaneous notes on dependencies

The ipython and jedi packages are pinned to specific versions in the demo pipenv environment to ensure their compatibility with extensions and code completion within Jupyter notebooks; unfortunately, they break with later versions due to a lack of upstream updates.

Changes

1.6.5 (18 June 2021)
- Made Corpus objects iterable. Document objects are also now accessible by index.
- Fixed LDAModel to handle documents with empty token lists. These documents come about when all their tokens are removed by the root stop word list at run-time.
1.5.0 (1 June 2021)
- General functionality update to match development version; enhancements and improvements across the board.
- Updated demo walkthrough notebooks.

Expand source code

"""
.. include:: ../README.md
"""

import ignis.aurum
import ignis.corpus
import ignis.probat

from ignis.__version__ import __version__

Corpus = ignis.corpus.Corpus
load_corpus = ignis.corpus.load_corpus
load_slice = ignis.corpus.load_slice

train_model = ignis.probat.train_model
suggest_num_topics = ignis.probat.suggest_num_topics

load_results = ignis.aurum.load_results

# Documentation
__pdoc__ = {"tests": False}
__all__ = [
    "load_corpus",
    "load_slice",
    "train_model",
    "suggest_num_topics",
    "load_results",
]

Sub-modules

ignis.aurum: Aurum instances manage the results of topic modelling runs, and provide methods for exploring and iterating over them.
ignis.corpus: Corpus and CorpusSlice instances are containers for tracking the Document objects in a dataset.
ignis.labeller: Classes for working with automated topic labellers. End-users should not need to use these classes directly.
ignis.models: Classes for working with topic modelling algorithms. End-users should not need to use these classes directly.
ignis.probat: Functions for performing topic modelling to get Aurum results.
ignis.util: General utility classes that are not directly part of the topic modelling functionality provided by ignis.
ignis.vis: Classes for working with topic modelling result visualisations. End-users should not need to use these classes directly.

Functions

def load_corpus(filename)

Loads a Corpus object from the given file.

Parameters

filename : str or pathlib.Path: The file to load the Corpus object from.

Returns

Corpus

Expand source code

def load_corpus(filename):
    """
    Loads a `ignis.corpus.Corpus` object from the given file.

    Parameters
    ----------
    filename: str or pathlib.Path
        The file to load the `ignis.corpus.Corpus` object from.

    Returns
    -------
    ignis.corpus.Corpus
    """
    with bz2.open(filename, "rb") as fp:
        loaded = pickle.load(fp)

    if not type(loaded) is Corpus:
        raise ValueError(f"File does not contain a `Corpus` object: '{filename}'")

    # Re-initialise the `Corpus` with all `Documents` and the dynamic stop word list.
    new_corpus = Corpus()
    if hasattr(loaded, "_stop_words"):
        # Copy stop words, if they are set
        new_corpus._stop_words = set(loaded._stop_words)
    if hasattr(loaded, "documents"):
        # Old version - `.documents` was directly accessible.
        docs = loaded.documents.values()
    else:
        # New version - documents can only be retrieved via `get_document()`
        docs = loaded._documents.values()
    for doc in docs:
        # `Document` objects are un-pickled separately; `Document.__setstate__()`
        # ensures that the `raw_tokens` attribute is set appropriately.
        new_corpus.add_doc(
            doc.raw_tokens, doc.metadata, doc.display_str, doc.plain_text
        )

    return new_corpus

def load_results(filename)

Loads an Aurum results object from the given file.

Parameters

filename : str or pathlib.Path: The file to load the Aurum object from.

Returns

Aurum

Expand source code

def load_results(filename):
    """
    Loads an `ignis.aurum.Aurum` results object from the given file.

    Parameters
    ----------
    filename: str or pathlib.Path
        The file to load the `ignis.aurum.Aurum` object from.

    Returns
    -------
    ignis.aurum.Aurum
    """
    with bz2.open(filename, "rb") as fp:
        save_object = pickle.load(fp)

    # Rehydrate the Ignis/external models
    model_type = save_object["model_type"]
    save_model = save_object["save_model"]
    external_model_bytes = save_object["external_model_bytes"]

    if model_type[:3] == "tp_":
        # Tomotopy model
        external_model = _load_tomotopy_model(model_type, external_model_bytes)
    else:
        raise ValueError(f"Unknown model type: '{model_type}'")

    save_model.model = external_model

    # Rehydrate the Aurum object
    aurum = Aurum(save_model)

    aurum.vis_type = save_object["vis_type"]
    aurum.vis_options = save_object["vis_options"]
    aurum.vis_data = save_object["vis_data"]

    return aurum

def load_slice(filename)

Loads a CorpusSlice object from the given file.

Parameters

filename : str or pathlib.Path: The file to load the CorpusSlice object from.

Returns

CorpusSlice

Expand source code

def load_slice(filename):
    """
    Loads a `ignis.corpus.CorpusSlice` object from the given file.

    Parameters
    ----------
    filename: str or pathlib.Path
        The file to load the `ignis.corpus.CorpusSlice` object from.

    Returns
    -------
    ignis.corpus.CorpusSlice
    """
    with bz2.open(filename, "rb") as fp:
        loaded = pickle.load(fp)

    if not type(loaded) is CorpusSlice:
        raise ValueError(f"File does not contain a `CorpusSlice` object: '{filename}'")

    return loaded

def suggest_num_topics(*args, verbose=True, **kwargs)

Convenience function for running compare_topic_count_coherence() and directly reporting the topic count with the highest coherence found.

Parameters

verbose : bool, optional: Whether or not to print the details of the best topic count.
*args, **kwargs: Passed on to compare_topic_count_coherence().

Returns

int: The suggested topic count.

Expand source code

def suggest_num_topics(*args, verbose=True, **kwargs):
    """
    Convenience function for running `compare_topic_count_coherence()` and directly
    reporting the topic count with the highest coherence found.

    Parameters
    ----------
    verbose: bool, optional
        Whether or not to print the details of the best topic count.
    *args, **kwargs
        Passed on to `compare_topic_count_coherence()`.

    Returns
    -------
    int
        The suggested topic count.
    """
    results = compare_topic_count_coherence(*args, verbose=verbose, **kwargs)
    results = sorted(results, key=lambda x: x[1], reverse=True)
    best = results[0]

    if verbose:
        print(f"Suggested topic count: {best[0]}\t" f"Coherence: {best[1]}")

    return best[0]

def train_model(corpus_slice, pre_model=None, model_type='tp_lda', model_options=None, labeller_type=None, labeller_options=None, vis_type='pyldavis', vis_options=None)

Top-level helper for training topic models using the various algorithms available.

Parameters

corpus_slice : Corpus or CorpusSlice: The CorpusSlice to perform the topic modelling over. If a Corpus is passed instead, a CorpusSlice containing all of its Document objects will be created.
pre_model : LDAModel, optional: This is needed when you want to train a tomotopy LDA model with word priors. Default is None.
model_type : {"tp_lda", "tp_hdp", "tp_lda_wp"}: Type of model to train; corresponds to the model type listed in the relevant ignis.models class.
model_options : dict, optional: Dictionary of options that will be passed to the relevant ignis.models model constructor.
labeller_type : {"tomotopy"}, optional: The type of automated labeller to use, if available.
labeller_options : dict, optional: Dictionary of options that will be passed to the relevant ignis.labeller object constructor.
vis_type : {"pyldavis"}, optional: The type of visualisation data to extract, if available.
vis_options : dict, optional: Dictionary of options that will be passed to the relevant ignis.vis object constructor.

Returns

Aurum: The Aurum results object for the trained model, which can be used for further exploration and iteration.

Expand source code

def train_model(
    corpus_slice,
    pre_model=None,
    model_type="tp_lda",
    model_options=None,
    labeller_type=None,
    labeller_options=None,
    vis_type="pyldavis",
    vis_options=None,
):
    """
    Top-level helper for training topic models using the various algorithms available.

    Parameters
    ----------
    corpus_slice: ignis.corpus.Corpus or ignis.corpus.CorpusSlice
        The `ignis.corpus.CorpusSlice` to perform the topic modelling over.  If a
        `ignis.corpus.Corpus` is passed instead, a `ignis.corpus.CorpusSlice`
        containing all of its `ignis.corpus.Document` objects will be created.
    pre_model: ignis.models.lda.LDAModel, optional
        This is needed when you want to train a `tomotopy` LDA model with word priors.
        Default is `None`.
    model_type: {"tp_lda", "tp_hdp", "tp_lda_wp"}
        Type of model to train; corresponds to the model type listed in the relevant
        `ignis.models` class.
    model_options: dict, optional
        Dictionary of options that will be passed to the relevant `ignis.models`
        model constructor.
    labeller_type: {"tomotopy"}, optional
        The type of automated labeller to use, if available.
    labeller_options: dict, optional
        Dictionary of options that will be passed to the relevant `ignis.labeller`
        object constructor.
    vis_type: {"pyldavis"}, optional
        The type of visualisation data to extract, if available.
    vis_options: dict, optional
        Dictionary of options that will be passed to the relevant `ignis.vis`
        object constructor.

    Returns
    -------
    ignis.aurum.Aurum
        The `ignis.aurum.Aurum` results object for the trained model, which can be used
        for further exploration and iteration.
    """
    if type(corpus_slice) is ignis.corpus.Corpus:
        corpus_slice = corpus_slice.slice_full()
    if not type(corpus_slice) is ignis.corpus.CorpusSlice:
        raise ValueError(
            "Ignis models must be instantiated with Corpus or CorpusSlice instances."
        )

    if model_type == "tp_lda":
        model = ignis.models.LDAModel(corpus_slice, model_options)
        model.train()
        aurum = ignis.aurum.Aurum(model)
    elif model_type == "tp_lda_wp":
        # Tomotopy LDA model with word priors
        # (contrib.: C. Ow)
        if isinstance(pre_model, ignis.models.lda.LDAModel):
            model = pre_model
        else:
            raise ValueError(
                "Ignis models with word priors must be pre-instantiated "
                "`ignis.models.lda.LDAModel` instances."
            )
        model.train()
        aurum = ignis.aurum.Aurum(model)
    elif model_type == "tp_hdp":
        model = ignis.models.HDPModel(corpus_slice, model_options)
        model.train()
        aurum = ignis.aurum.Aurum(model)
    else:
        raise ValueError(f"Unknown model type: '{model_type}'")

    if labeller_type is not None:
        if labeller_options is None:
            labeller_options = {}

        aurum.init_labeller(labeller_type, **labeller_options)

    if vis_type is not None:
        if vis_options is None:
            vis_options = {}

        aurum.init_vis(vis_type, **vis_options)

    return aurum