coniferest package

Subpackages

Submodules

coniferest.aadforest module

class coniferest.aadforest.AADForest(n_trees=100, n_subsamples=256, max_depth=None, budget='auto', C_a=1.0, prior_influence=1.0, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576, map_value=None)[source]

Bases: Coniferest

Active Anomaly Detection with Isolation Forest.

See Das et al., 2017 https://arxiv.org/abs/1708.09441

The method solves the optimization problem:

\[\mathbf{w} = \arg\min_{\mathbf{w}} \left( \frac{C_a}{\left|\mathcal{A}\right|} \sum_{i \in \mathcal{A}} \mathrm{ReLU}\left(s(\mathbf{x_i} | \mathbf{w}) - q_{\tau}\right) + \frac{1}{\left|\mathcal{N}\right|} \sum_{i \in \mathcal{N}} \mathrm{ReLU}\left(q_{\tau} - s(\mathbf{x_i} | \mathbf{w})\right) + \frac{\alpha}{2} \lVert \mathbf{w} - \mathbf{w_0}\rVert^2\right),\]

where \(C_a\) is C_a, regularization parameter \(\alpha\) is prior_influence, \(\mathcal{A}\) is a set of known anomalies, \(\mathcal{N}\) is a set of known nominals, \(s(\mathbf{x_i} | \mathbf{w})\) is the anomaly score of instance with features \(\mathbf{x_i}\) given weights \(\mathbf{w}\).

This problem is reformulated as an equivalent quadratic programming problem:

\[\begin{split}\begin{bmatrix} \mathbf{w}\\ \mathbf{u} \end{bmatrix} = \arg\min_{\mathbf{w}, \mathbf{u}} \left( \frac{C_a}{\left|\mathcal{A}\right|} \sum_{i \in \mathcal{A}} u_i + \frac{1}{\left|\mathcal{N}\right|} \sum_{i \in \mathcal{N}} u_i + \frac{\alpha}{2} \lVert \mathbf{w} - \mathbf{w_0} \rVert^2\right),\end{split}\]

with the following convex constraints:

\[\begin{split}u_i &\ge 0 \quad & i \in \mathcal{A} \cup \mathcal{N},\\ u_i - s(\mathbf{x_i} | \mathbf{w}) &\ge - q_{\tau}\quad & i \in \mathcal{A},\\ u_i + s(\mathbf{x_i} | \mathbf{w}) &\ge q_{\tau}\quad & i \in \mathcal{N}.\\\end{split}\]
Parameters:
  • n_trees (int, optional) – Number of trees in the isolation forest.

  • n_subsamples (int, optional) – How many subsamples should be used to build every tree.

  • max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.

  • budget (int or float or "auto", optional) – Budget of anomalies. If the type is floating point it is considered as fraction of full data. If the type is integer it is considered as the number of items. If string “auto” is set then the exact parameter is found during the train. Default is “auto”.

  • n_jobs (int, default=-1) – Number of threads to use for scoring. If -1, use all available CPUs.

  • random_seed (int or None, optional) – Random seed to use for reproducibility. If None - random seed is used.

  • prior_influence (float or callable, optional) – An regularization coefficient value in the loss functioin. Default is 1.0. Signature: ‘(anomaly_count, nominal_count) -> float’

  • map_value (["const", "exponential", "linear", "reciprocal"] or callable, optional) – An function applied to the leaf depth before weighting. Possible meaning variants are: 1, 1-exp(-x), x, -1/x.

apply(x)[source]

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

feature_importance(x)[source]
feature_signature(x)[source]
fit(data, labels=None)[source]

Build the trees with the data data.

Parameters:
  • data – Array with feature values of objects.

  • labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]

The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.

Parameters:
  • data – Training data (array with feature values) to build trees with.

  • known_data – Feature values of known data.

  • known_labels – Labels of known data.

Return type:

self

score_samples(samples)[source]

Computer scores for the supplied data.

Parameters:

samples – Feature values to compute scores on.

Return type:

Array with computed scores.

coniferest.calc_paths_sum module

coniferest.coniferest module

class coniferest.coniferest.Coniferest(trees=None, n_subsamples=256, max_depth=None, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576)[source]

Bases: ABC

Base class for the forests in the package. It settles the basic low-level machinery with the sklearn’s trees, used here.

Parameters:
  • trees (list or None, optional) – List with the trees in the forest. If None, then empty list is used.

  • n_subsamples (int, optional) – Subsamples to use for the training.

  • max_depth (int or None, optional) – Maximum depth of the trees in use. If None, then log2(n_subsamples) is used.

  • n_jobs (int, default=-1) – Number of threads to use for scoring. If -1, use all available CPUs.

  • random_seed (int or None, optional) – Seed for the reproducibility. If None, then random seed is used.

build_one_tree(data)[source]

Build just one tree.

Parameters:

data – Features to build that one tree of.

Return type:

A tree.

build_trees(data, n_trees)[source]

Just build n_trees trees from supplied data.

Parameters:
  • data – Features.

  • n_trees – Number of trees to build

Return type:

List of trees.

abstractmethod feature_importance(x)[source]
abstractmethod feature_signature(x)[source]
abstractmethod fit(data, labels=None)[source]

Fit to the applied data.

abstractmethod fit_known(data, known_data=None, known_labels=None)[source]

Fit to the applied data with priors.

abstractmethod score_samples(samples)[source]

Evaluate scores for samples.

class coniferest.coniferest.ConiferestEvaluator(coniferest, map_value=None)[source]

Bases: ForestEvaluator

Fast evaluator of scores for Coniferests.

Parameters:
  • coniferest (Coniferest) – The forest for building the evaluator from.

  • map_value (callable or None) – Optional function to map leaf values, mast accept 1-D array of values and return an array of the same shape.

classmethod extract_selectors(tree, map_value=None)[source]

Extract node representations for the tree.

Parameters:
  • tree – Tree to extract selectors from.

  • map_value – Optional function to map leaf values

Return type:

Array with selectors.

coniferest.evaluator module

class coniferest.evaluator.ForestEvaluator(samples, selectors, node_offsets, leaf_offsets, *, num_threads, sampletrees_per_batch)[source]

Bases: object

apply(x)[source]
classmethod average_path_length(n_nodes)[source]

Average path length is abstracted because in different cases we may want to use a bit different formulas to make the exact match with other software.

By default we use our own implementation.

property batch_size
classmethod combine_selectors(selectors_list)[source]

Combine several node arrays into one array of nodes and one array of start node_offsets.

Parameters:

selectors_list – List of node arrays to combine.

Returns:

  • np.ndarray of selectors – Node array with all the nodes from all the trees.

  • np.ndarray of int – Array of tree offsets for node-arrays.

  • np.ndarray of int – Array of tree offsets for leaf-arrays.

feature_importance(x)[source]
feature_signature(x)[source]
property n_leaves
property n_trees
score_samples(x)[source]

Perform the computations.

Parameters:

x – Features to calculate scores of. Should be C-contiguous for performance.

Return type:

Array of scores.

selector_dtype = dtype([('feature', '<i4'), ('left', '<i4'), ('value', '<f8'), ('right', '<i4'), ('node_average_path_length', '<f4')], align=True)

coniferest.experiment module

coniferest.isoforest module

class coniferest.isoforest.IsolationForest(n_trees=100, n_subsamples=256, max_depth=None, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576)[source]

Bases: Coniferest

Isolation forest.

This is a reimplementation of sklearn.ensemble.IsolationForest, which trains and evaluates much faster. It also supports multi-threading for evaluation (sample scoring).

Parameters:
  • n_trees (int, optional) – Number of trees in forest to build.

  • n_subsamples (int, optional) – Number of subsamples to use for building the trees.

  • max_depth (int or None, optional) – Maximal tree depth. If None, log2(n_subsamples) is used.

  • n_jobs (int, default=-1) – Number of threads to use for evaluation. If -1, use all available CPUs.

  • random_seed (int or None, optional) – Seed for reproducibility. If None, random seed is used.

apply(x)[source]

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

feature_importance(x)[source]
feature_signature(x)[source]
fit(data, labels=None)[source]

Build the trees based on data.

Parameters:
  • data – 2-d array with features.

  • labels – Unused. Defaults to None.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]

Fit to the applied data with priors.

score_samples(samples)[source]

Compute scores for given samples.

Parameters:

samples – 2-d array with features.

Return type:

1-d array with scores.

coniferest.label module

class coniferest.label.Label(*values)[source]

Bases: IntEnum

Anomalous classification labels.

Three types of labels:

  • -1 for anomalies, referenced either as Label.ANOMALY or as Label.A,

  • 0 for unknowns: Label.UNKNOWN or Label.U,

  • 1 for regular data: Label.REGULAR or Label.R.

A = -1
ANOMALY = -1
R = 1
REGULAR = 1
U = 0
UNKNOWN = 0

coniferest.limeforest module

class coniferest.limeforest.LimeEvaluator(pine_forest)[source]

Bases: ForestEvaluator

classmethod extract_selectors(pine)[source]
class coniferest.limeforest.RandomLime(features, selectors, values)[source]

Bases: object

paths(x)[source]
class coniferest.limeforest.RandomLimeForest(trees=100, subsamples=256, depth=None, seed=0)[source]

Bases: object

fit(data)[source]
mean_paths(data)[source]
scores(data)[source]
class coniferest.limeforest.RandomLimeGenerator(sample, depth, seed=0)[source]

Bases: object

coniferest.pineforest module

class coniferest.pineforest.PineForest(n_trees=100, n_subsamples=256, max_depth=None, n_spare_trees=400, regenerate_trees=False, weight_ratio=1.0, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576)[source]

Bases: Coniferest

Pine Forest for active anomaly detection.

Pine Forests are filtering isolation forests. That’s a simple concept of incorporating prior knowledge about what is anomalous and what is not.

Standard fit procedure with two parameters works exactly the same as the isolation forests’ one. It differs when we supply additional parameter labels, then the behaviour changes. At that case fit generates additional not only n_trees but with additional n_spare_trees and then filters out n_spare_trees, leaving only those n_trees that deliver better scores for the data known to be anomalous.

Parameters:
  • n_trees (int, optional) – Number of trees to keep for estimating anomaly scores.

  • n_subsamples (int, optional) – How many subsamples should be used to build every tree.

  • max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.

  • n_spare_trees (int, optional) – Number of trees to generate additionally for further filtering.

  • regenerate_trees (bool, optional) – Should we throughout all the trees during retraining or should we mix old trees with the fresh ones. False by default, so we mix.

  • weight_ratio (float, optional) – What is the relative weight of false positives relative to true positives (i.e. we are not interested in negatives in anomaly detection, right?). The weight is used during the filtering process.

  • n_jobs (int, default=-1) – Number of threads to use for scoring. If -1, use all available CPUs.

  • random_seed (int or None, optional) – Random seed. If None - random seed is used.

apply(x)[source]

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

feature_importance(x)[source]
feature_signature(x)[source]
filter_trees(trees, data, labels, n_filter, weight_ratio=1)[source]

Filter the trees out.

Parameters:
  • trees – Trees to filter.

  • n_filter – Number of trees to filter out.

  • data – The labeled objects themselves.

  • labels – The labels of the objects. -1 is anomaly, 1 is not anomaly, 0 is uninformative.

  • weight_ratio – Weight of the false positive experience relative to false negative. Defaults to 1.

fit(data, labels=None)[source]

Build the trees with the data data.

Parameters:
  • data – Array with feature values of objects.

  • labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]

The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.

Parameters:
  • data – Training data (array with feature values) to build trees with.

  • known_data – Feature values of known data.

  • known_labels – Labels of known data.

Return type:

self

score_samples(samples)[source]

Computer scores for the supplied data.

Parameters:

samples – Feature values to compute scores on.

Return type:

Array with computed scores.

coniferest.utils module

coniferest.utils.average_path_length(n)[source]

Average path length computation.

Parameters:

n – Either array of tree depths to computer average path length of or one tree depth scalar.

Return type:

Average path length.