{ "cells": [ { "cell_type": "markdown", "id": "a28b85f0-99c1-4a06-a962-403a0ebc0292", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "fe80b124b7d573c4", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "source": [ "# US baby names dataset\n", "\n", "This notebook gives an example of Active Anomaly Detection with `coniferest` and [US baby names](https://www.ssa.gov/OACT/babynames/) dataset.\n", "\n", "Developers of `conferest`:\n", "- [Matwey Kornilov (MSU)](https://matwey.name)\n", "- [Vladimir Korolev](https://www.linkedin.com/in/vladimir-korolev-a4195b86/)\n", "- [Konstantin Malanchev (LINCC Frameworks / CMU)](https://homb.it), notebook author\n", "\n", "The tutorial is co-authored by [Etienne Russeil (LPC)](https://github.com/erusseil)" ] }, { "cell_type": "markdown", "id": "9540d709-09ca-41ed-bceb-92370a82bd37", "metadata": {}, "source": [ "**[Run this NB in Google Colab](https://colab.research.google.com/github/snad-space/coniferest/blob/master/docs/notebooks/us-names.ipynb)**" ] }, { "cell_type": "markdown", "id": "b3cef2a4-3270-4f94-b37c-bd8ada31a53c", "metadata": {}, "source": [ "## Install and import modules" ] }, { "cell_type": "code", "id": "826cf8dd-003a-42e3-83df-c19ded1d2b83", "metadata": { "ExecuteTime": { "end_time": "2025-10-17T17:49:41.831606Z", "start_time": "2025-10-17T17:49:41.068336Z" } }, "source": "%pip install 'coniferest[datasets]' pandas", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pandas in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (2.3.0)\r\n", "Requirement already satisfied: coniferest[datasets] in /Users/hombit/projects/supernovaAD/coniferest (0.0.11)\r\n", "\u001B[33mWARNING: coniferest 0.0.11 does not provide the extra 'datasets'\u001B[0m\u001B[33m\r\n", "\u001B[0mRequirement already satisfied: click in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from coniferest[datasets]) (8.2.1)\r\n", "Requirement already satisfied: joblib in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from coniferest[datasets]) (1.5.1)\r\n", "Requirement already satisfied: numpy in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from coniferest[datasets]) (2.3.1)\r\n", "Requirement already satisfied: scikit-learn<2,>=1 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from coniferest[datasets]) (1.7.0)\r\n", "Requirement already satisfied: matplotlib in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from coniferest[datasets]) (3.10.3)\r\n", "Requirement already satisfied: onnxconverter-common in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from coniferest[datasets]) (1.14.0)\r\n", "Requirement already satisfied: scipy>=1.8.0 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from scikit-learn<2,>=1->coniferest[datasets]) (1.16.0)\r\n", "Requirement already satisfied: threadpoolctl>=3.1.0 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from scikit-learn<2,>=1->coniferest[datasets]) (3.6.0)\r\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from pandas) (2.9.0.post0)\r\n", "Requirement already satisfied: pytz>=2020.1 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from pandas) (2025.2)\r\n", "Requirement already satisfied: tzdata>=2022.7 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from pandas) (2025.2)\r\n", "Requirement already satisfied: six>=1.5 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\r\n", "Requirement already satisfied: contourpy>=1.0.1 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (1.3.2)\r\n", "Requirement already satisfied: cycler>=0.10 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (0.12.1)\r\n", "Requirement already satisfied: fonttools>=4.22.0 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (4.58.4)\r\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (1.4.8)\r\n", "Requirement already satisfied: packaging>=20.0 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (25.0)\r\n", "Requirement already satisfied: pillow>=8 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (11.2.1)\r\n", "Requirement already satisfied: pyparsing>=2.3.1 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from matplotlib->coniferest[datasets]) (3.2.3)\r\n", "Requirement already satisfied: onnx in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from onnxconverter-common->coniferest[datasets]) (1.17.0)\r\n", "Requirement already satisfied: protobuf==3.20.2 in /Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages (from onnxconverter-common->coniferest[datasets]) (3.20.2)\r\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "execution_count": 1 }, { "cell_type": "code", "id": "1bbe8848", "metadata": { "ExecuteTime": { "end_time": "2025-10-17T17:49:43.475648Z", "start_time": "2025-10-17T17:49:41.842695Z" } }, "source": [ "import datasets\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import requests\n", "from coniferest.pineforest import PineForest\n", "from coniferest.isoforest import IsolationForest\n", "from coniferest.pineforest import PineForest\n", "from coniferest.session import Session\n", "from coniferest.session.callback import TerminateAfter, prompt_decision_callback, Label" ], "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/hombit/.virtualenvs/coniferest/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "execution_count": 2 }, { "cell_type": "markdown", "id": "f85849cf-ae6b-4422-a7b6-39defe4400b6", "metadata": {}, "source": [ "## Data preparation" ] }, { "cell_type": "markdown", "id": "3010d52e-2e26-494a-90e2-d8c552f39dd9", "metadata": {}, "source": [ "Download data and put into a single data frame" ] }, { "cell_type": "code", "id": "ce14ad6e-838b-4ea7-a97b-b90c2478b228", "metadata": { "ExecuteTime": { "end_time": "2025-10-17T17:49:49.172101Z", "start_time": "2025-10-17T17:49:43.522024Z" } }, "source": [ "%%time\n", "\n", "# Hugging Face dataset constructed from\n", "# https://www.ssa.gov/OACT/babynames/state/namesbystate.zip\n", "dataset = datasets.load_dataset(\"snad-space/us-names-by-state\")\n", "raw = dataset['train'].to_pandas()\n", "\n", "raw" ], "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading data: 100%|██████████| 51/51 [00:01<00:00, 41.41files/s]\n", "Generating train split: 100%|██████████| 6600640/6600640 [00:02<00:00, 2268886.44 examples/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.14 s, sys: 970 ms, total: 4.1 s\n", "Wall time: 5.64 s\n" ] }, { "data": { "text/plain": [ " State Gender Year Name Count\n", "0 AK F 1910 Mary 14\n", "1 AK F 1910 Annie 12\n", "2 AK F 1910 Anna 10\n", "3 AK F 1910 Margaret 8\n", "4 AK F 1910 Helen 7\n", "... ... ... ... ... ...\n", "6600635 WY M 2024 Royce 5\n", "6600636 WY M 2024 Spencer 5\n", "6600637 WY M 2024 Truett 5\n", "6600638 WY M 2024 Wylder 5\n", "6600639 WY M 2024 Xander 5\n", "\n", "[6600640 rows x 5 columns]" ], "text/html": [ "
| \n", " | State | \n", "Gender | \n", "Year | \n", "Name | \n", "Count | \n", "
|---|---|---|---|---|---|
| 0 | \n", "AK | \n", "F | \n", "1910 | \n", "Mary | \n", "14 | \n", "
| 1 | \n", "AK | \n", "F | \n", "1910 | \n", "Annie | \n", "12 | \n", "
| 2 | \n", "AK | \n", "F | \n", "1910 | \n", "Anna | \n", "10 | \n", "
| 3 | \n", "AK | \n", "F | \n", "1910 | \n", "Margaret | \n", "8 | \n", "
| 4 | \n", "AK | \n", "F | \n", "1910 | \n", "Helen | \n", "7 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 6600635 | \n", "WY | \n", "M | \n", "2024 | \n", "Royce | \n", "5 | \n", "
| 6600636 | \n", "WY | \n", "M | \n", "2024 | \n", "Spencer | \n", "5 | \n", "
| 6600637 | \n", "WY | \n", "M | \n", "2024 | \n", "Truett | \n", "5 | \n", "
| 6600638 | \n", "WY | \n", "M | \n", "2024 | \n", "Wylder | \n", "5 | \n", "
| 6600639 | \n", "WY | \n", "M | \n", "2024 | \n", "Xander | \n", "5 | \n", "
6600640 rows × 5 columns
\n", "| \n", " | year_1910 | \n", "year_1911 | \n", "year_1912 | \n", "year_1913 | \n", "year_1914 | \n", "year_1915 | \n", "year_1916 | \n", "year_1917 | \n", "year_1918 | \n", "year_1919 | \n", "... | \n", "freq_10 | \n", "freq_11 | \n", "freq_12 | \n", "freq_13 | \n", "freq_14 | \n", "freq_15 | \n", "freq_16 | \n", "freq_17 | \n", "freq_18 | \n", "freq_19 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aaliyah | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.007815 | \n", "0.011471 | \n", "0.009205 | \n", "0.002661 | \n", "0.000384 | \n", "0.003367 | \n", "0.004686 | \n", "0.002152 | \n", "0.000062 | \n", "0.000980 | \n", "
| Aaron | \n", "0.045721 | \n", "0.053748 | \n", "0.062029 | \n", "0.077329 | \n", "0.074982 | \n", "0.065564 | \n", "0.066903 | \n", "0.066013 | \n", "0.065492 | \n", "0.066256 | \n", "... | \n", "0.000203 | \n", "0.000332 | \n", "0.000003 | \n", "0.000209 | \n", "0.000003 | \n", "0.000213 | \n", "0.000325 | \n", "0.000083 | \n", "0.000084 | \n", "0.000259 | \n", "
| Abbey | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.003283 | \n", "0.002742 | \n", "0.000650 | \n", "0.000279 | \n", "0.000762 | \n", "0.000680 | \n", "0.000231 | \n", "0.000050 | \n", "0.000016 | \n", "0.000096 | \n", "
| Abbie | \n", "0.359963 | \n", "0.269810 | \n", "0.306465 | \n", "0.464630 | \n", "0.215549 | \n", "0.302426 | \n", "0.349251 | \n", "0.307615 | \n", "0.197811 | \n", "0.346344 | \n", "... | \n", "0.000792 | \n", "0.000478 | \n", "0.000180 | \n", "0.000260 | \n", "0.001956 | \n", "0.000199 | \n", "0.000055 | \n", "0.000111 | \n", "0.000089 | \n", "0.000495 | \n", "
| Abbigail | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.002717 | \n", "0.001827 | \n", "0.001172 | \n", "0.001199 | \n", "0.001423 | \n", "0.001123 | \n", "0.000468 | \n", "0.000081 | \n", "0.000061 | \n", "0.000128 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| Zelma | \n", "0.932474 | \n", "1.000000 | \n", "0.729411 | \n", "0.723821 | \n", "0.652394 | \n", "0.669621 | \n", "0.600106 | \n", "0.576397 | \n", "0.536196 | \n", "0.576366 | \n", "... | \n", "0.012596 | \n", "0.010955 | \n", "0.009435 | \n", "0.007556 | \n", "0.005983 | \n", "0.005520 | \n", "0.005827 | \n", "0.005979 | \n", "0.005959 | \n", "0.006366 | \n", "
| Zion | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.022560 | \n", "0.009141 | \n", "0.002182 | \n", "0.006346 | \n", "0.010988 | \n", "0.008036 | \n", "0.002212 | \n", "0.001621 | \n", "0.005530 | \n", "0.006818 | \n", "
| Zoe | \n", "0.009245 | \n", "0.000000 | \n", "0.003225 | \n", "0.000000 | \n", "0.001845 | \n", "0.002542 | \n", "0.004275 | \n", "0.004373 | \n", "0.002134 | \n", "0.001221 | \n", "... | \n", "0.010147 | \n", "0.009064 | \n", "0.004659 | \n", "0.001311 | \n", "0.001038 | \n", "0.002035 | \n", "0.002939 | \n", "0.003169 | \n", "0.002656 | \n", "0.002130 | \n", "
| Zoey | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.010202 | \n", "0.010542 | \n", "0.008716 | \n", "0.005578 | \n", "0.002431 | \n", "0.000437 | \n", "0.000107 | \n", "0.000786 | \n", "0.001459 | \n", "0.001728 | \n", "
| Zuri | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.041216 | \n", "0.032491 | \n", "0.029391 | \n", "0.026944 | \n", "0.022914 | \n", "0.018828 | \n", "0.013810 | \n", "0.008968 | \n", "0.007011 | \n", "0.007291 | \n", "
2337 rows × 135 columns
\n", "