Tutorial ======== Installation ------------ **tl;dr:** ``python3 -mpip install coniferest`` You usually want to create virtual environment first, for example with ``venv``: .. code-block:: bash python3 -mvenv venv source venv/bin/activate We need up-to-date pip: .. code-block:: bash python3 -mpip install -U pip Finally, install ``coniferest``: .. code-block:: bash python3 -mpip install coniferest For any problems, please `file an issue on the GitHub `_. Example: non-active anomaly detection ------------------------------------- Let's generate a simple 2-D dataset with a single outlier as a last object, and run :class:`IsolationForest ` model (see `Liu et al. 2008 `_) on it: .. code-block:: python from coniferest.datasets import single_outlier from coniferest.isoforest import IsolationForest data, _metadata = single_outlier(10_000) model = IsolationForest(random_seed=0) model.fit(data) scores = model.score_samples(data) print("Index of the outlier:", scores.argmin()) Example: ZTF light curves of M31 field -------------------------------------- Let's use built-in dataset of ZTF light curve features adopted from `Malanchev at al. (2021) `_: .. code-block:: python from coniferest.datasets import ztf_m31 data, metadata = ztf_m31() print(data.shape) Here ``data`` is 2-D feature dataset (first axis is for objects, second is for features) and ``metadata`` is 1-D array of ZTF DR object IDs. Next we need a active anomaly detection model to find outliers in this dataset. We will use :class:`PineForest ` model developed by the `SNAD `_ team. It is based on Isolation forest model and uses a tree filtering algorithm to learn from the decisions made by the user: .. code-block:: python from coniferest.pineforest import PineForest model = PineForest( # Number of trees to use for predictions n_trees=256, # Number of new tree to grow for each decision n_spare_trees=768, # Fix random seed for reproducibility random_seed=0, ) Now we are ready to run active anomaly detection :class:`Session `: .. code-block:: python from coniferest.session import Session from coniferest.session.callback import ( TerminateAfter, viewer_decision_callback, ) session = Session( data=data, metadata=metadata, model=model, # Prompt for a decision and open object's page on the SNAD Viewer decision_callback=viewer_decision_callback, on_decision_callbacks=[ # Terminate session after 10 decisions TerminateAfter(10), ], ) session.run() This will prompt you to make a decision for an object with the highest outlier score and show you this object in the browser. Each decision you make retrains the model and updates the outlier scores. After 10 decisions the session will be terminated, but you can also stop it by pressing ``Ctrl+C``. Please note that :class:`Session ` mutates the model, so you should re-create the model if you want to start a new session with the blank model. For the first candidate you should get a recurrent variable `ZTF DR 695211200075348 `_ / `M31N 2013-11b `_ / `MASTER OTJ004126.22+414350.0 `_. SNAD team reported this object as an anomaly in `Malanchev at al. (2021) `_, it is believed to be a recurrent Nova or `a long-period variable star `_. If you are interested in this type of object, say Y otherwise mark it as not-interesting by replying N and continue your investigations — the objects shown to you will depend on the sequence of your replies. After the session is finished you can explore :class:`Session ` objects for the decisions you made and final state of the model: .. code-block:: python from pprint import pprint print('Decisions:') pprint({metadata[idx]: label.name for idx, label in session.known_labels.items()}) print('Final scores:') pprint({metadata[idx]: session.scores[idx] for idx in session.known_labels}) ``coniferest`` also provides "Active Anomaly Detection" model based on Isolation forest :class:`AADForest ` model (see `Das et al., 2017 `_ and `Ishida et al., 2021 `_ for details). You can use it instead of :class:`PineForest ` model in the example above: .. code-block:: python from coniferest.aadforest import AADForest model = AADForest( # Use 1024 trees, a trade-off between speed and accuracy n_trees=1024, # Fix random seed for reproducibility random_seed=0, ) Use your own data ----------------- In examples above we used built-in datasets, but you can easily use your own data. The only requirement is that your data should be a 2-D array of features, where first axis is for objects and second is for features. You also need an 1-D array of object metadata, which can be anything you want, but typically you would make it to be object IDs or names. Let's generate a simple 2-D dataset and run :class:`PineForest ` model on it using a simple interactive prompt as a decision callback: .. code-block:: python import numpy as np from coniferest.pineforest import PineForest from coniferest.session import Session from coniferest.session.callback import ( TerminateAfter, prompt_decision_callback, ) n_objects = 1000 n_features = 10 rng = np.random.default_rng(0) data = rng.normal(size=(n_objects, n_features)) # prompt_decision_callback will print object's metadata, so it should # have a pretty representation. Integer index is good for this example metadata = np.arange(n_objects) session = Session( data=data, metadata=metadata, model=PineForest(random_seed=0), decision_callback=prompt_decision_callback, on_decision_callbacks=TerminateAfter(10), ) session.run() Session life cycle ------------------ Examples above use :class:`Session ` class to run active anomaly detection. Let's take a closer look at how it works, please refer to :class:`Session ` API documentation for more details on the class constructor arguments and the class methods and attributes. 1. Initialize session with :class:`Session() ` constructor 2. Call :meth:`Session.run() ` method to start the session. It will do the following: a) Session initializes the model (calls :meth:`.fit() ` on it) with the `data` b) The decision loop starts and last until session is terminated or all data points are labeled: i) :attr:`model ` is fit for current :attr:`Session.known_labels ` (calls :meth:`.fit_known() ` on it) ii) ``on_refit_callbacks`` are called iii) If all data points are labeled, the session is terminated iv) ``decision_callback`` is called, the decision is stored in :attr:`Session.known_labels ` v) ``on_decision_callbacks`` are called vi) If session was terminated in one of the callbacks, then the decision loop is terminated. Otherwise, go to step i) 3. Inspect :class:`Session() ` object attributes like :attr:`known_labels `, :attr:`scores ` and :attr:`model ` to get the results of the session Note that :class:`Session ` mutates input model, so if you use the same model object again for a new session, it will continue from the previous model state. Tune the Session with callbacks ------------------------------- Now we know how :class:`Session ` works, but how can we change its behavior? The answer is callbacks. Callbacks are functions that are called at different stages of the session lifecycle. Let's implement a simple callback per each stage: .. code-block:: python from coniferest.pineforest import PineForest from coniferest.label import Label from coniferest.session import Session def my_on_refit_callback(session): print('Refitting model with known labels:') print(session.known_labels) def my_decision_callback(metadata, data, session): """Say YES when the first feature is positive""" print(f'Labeling object {metadata}') return Label.ANOMALY if data[0] > 0.0 else Label.REGULAR def my_on_decision_callback(metadata, data, session): print(f'Decision made for {metadata}: {session.last_decision}.') def terminate_after_5_anomalies(metadata, data, session): if session.known_anomalies.size >= 5: session.terminate() class RecordCallback: def __init__(self): self.records = [] def __call__(self, metadata, data, session): self.records.append(f'{metadata} -> {session.last_decision}') def print_report(self): print('Records:') print('\n'.join(self.records)) record_callback = RecordCallback() rng = np.random.default_rng(0) data = rng.normal(size=(1000, 2)) metadata = np.arange(data.shape[0]) model = PineForest(random_seed=0) session = Session( data=data, metadata=metadata, model=model, decision_callback=my_decision_callback, # We can give an only function/callable as a callback on_refit_callbacks=my_on_refit_callback, # Or a list of callables on_decision_callbacks=[ my_on_decision_callback, record_callback, terminate_after_5_anomalies, ], ) session.run() print() record_callback.print_report() Use prior knowledge with known labels ------------------------------------- The final component of the :class:> constructor we haven't discussed yet is ``known_labels``. It allows you to provide prior knowledge about few samples in the dataset. This is useful when you know that some samples are anomalies or regular objects, so you can find more objects you like or get less false positives. Let's use a simulated dataset with 1024 regular objects and three "outlier" classes, each with 32 objects. Within these three classes, only one considered as anomalous, while other two are some kind of "bogus" objects that we don't want to find. .. code-block:: python # Install matplotlib if you don't have it import matplotlib.pyplot as plt from coniferest.datasets import non_anomalous_outliers from coniferest.label import Label data, metadata = non_anomalous_outliers( # Number of regular objects inliers=1024, # Number of objects per "outlier" class outliers=32, # Classification of "outlier" classes regions=[Label.R, Label.R, Label.A], ) # Plot the data index = metadata == Label.R plt.scatter(*data[index, :].T, marker='.', color='#22114C', label='regular') plt.scatter(*data[~index, :].T, marker='*', color='#22114C', label='anomaly') plt.legend() plt.show() .. image:: _static/tutorial/known_labels_dataset.png Let's see what :class:`IsolationForest ` will find in this dataset within top-32 outliers: .. code-block:: python from coniferest.isoforest import IsolationForest scores = IsolationForest(random_seed=0).fit(data).score_samples(data) top32 = scores.argsort()[:32] color = np.full_like(metadata, '#22114C', dtype=object) color[top32] = '#FCBD43' plt.cla() plt.scatter(*data[index, :].T, marker='.', color=color[index], label='regular') plt.scatter(*data[~index, :].T, marker='*', color=color[~index], label='anomaly') plt.legend() plt.show() .. image:: _static/tutorial/known_labels_isoforest.png Here we show 32 candidates in "SNAD yellow" color. Not bad, but what can we do with the active anomaly detection? .. code-block:: python from coniferest.pineforest import PineForest from coniferest.session import Session from coniferest.session.callback import TerminateAfter session = Session( data, metadata, model=PineForest(random_seed=0), # metadata consists of true labels, so we can use it as a decision decision_callback=lambda metadata, data, session: metadata, on_decision_callbacks=TerminateAfter(32), ) session.run() color = np.full_like(metadata, '#22114C', dtype=object) color[list(session.known_labels)] = '#FCBD43' plt.cla() plt.scatter(*data[index, :].T, marker='.', color=color[index], label='regular') plt.scatter(*data[~index, :].T, marker='*', color=color[~index], label='anomaly') plt.show() .. image:: _static/tutorial/known_labels_pineforest.png Looks good, right? But what if we know that one of the "outlier" objects is not an anomaly? Let's suppose that before doing the active anomaly detection we investigated the most-left object and found that it is a bogus object. We can mark it as a regular object and run the active anomaly detection again: .. code-block:: python from coniferest.pineforest import PineForest from coniferest.session import Session from coniferest.session.callback import TerminateAfter # Key is index of the object, value is its label. known_labels = {data[:, 0].argmin(): Label.REGULAR} session = Session( data, metadata, known_labels=known_labels, model=PineForest(random_seed=0), decision_callback=lambda metadata, data, session: metadata, on_decision_callbacks=TerminateAfter(32), ) session.run() color = np.full_like(metadata, '#22114C', dtype=object) new_labels = set(session.known_labels) - set(known_labels) color[list(new_labels)] = '#FCBD43' plt.cla() plt.scatter(*data[index, :].T, marker='.', color=color[index], label='regular') plt.scatter(*data[~index, :].T, marker='*', color=color[~index], label='anomaly') plt.show() .. image:: _static/tutorial/known_labels_pineforest_known_regular.png Note that the object we marked as regular was not even selected by the previous run, but here it influenced the model to not select other objects of this class.