CLI Usage

Quick Start

# Fast data quality assessment (completeness, duplicates, outliers)
aidrin data-quality /path/to/sample_dataset.csv

# List all available metrics
aidrin list

# Run a single metric
aidrin run completeness /path/to/sample_dataset.csv

# Run a batch of metrics from a YAML config
aidrin batch /path/to/my_project/batch_config.yaml

Sample Dataset

All examples on this page use a single CSV file. You can also find ready-to-use sample datasets in examples/sample_data/ in the repository.

For the following example, run the following Python snippet once to generate a synthetic datasets yourself — then substitute /path/to/sample_dataset.csv with the actual path where you saved it:

import pandas as pd

data = {
    "age":          [34, 28, 42, 31, 25, 34, 38, 45, 29, 33],
    "income":       [75000, 45000, 95000, 55000, 35000, 75000, 85000, 950000, 42000, None],
    "credit_score": [720, 650, 780, 690, 600, 720, 750, 800, 620, 700],
    "education":    ["bachelor","high_school","master","bachelor","high_school",
                     "bachelor","bachelor","master","bachelor","bachelor"],
    "gender":       ["male","female","male","female","male","male","female","male","female","male"],
    "ethnicity":    ["white","hispanic","asian","black","white","white","asian","white","hispanic","black"],
    "zipcode":      [43201, 43201, 43202, 43202, 43203, 43201, 43203, 43204, 43204, 43205],
    "diagnosis":    ["diabetes","hypertension","diabetes","hypertension","asthma",
                     "diabetes","hypertension","asthma","diabetes","hypertension"],
    "approved":     [1, 0, 1, 1, 0, 1, 1, 1, 0, 0],
}
pd.DataFrame(data).to_csv("/path/to/sample_dataset.csv", index=False)

The dataset intentionally includes one duplicate row (rows 1 and 6), one missing income value (row 10), and one income outlier (row 8) so that completeness, duplicity, and outlier metrics return non-trivial results.


Commands

aidrin list

Lists all available metrics grouped by category.

aidrin list

# Filter by category
aidrin list --category data-quality

aidrin data-quality

Runs the three core data quality metrics (completeness, duplicity, and outliers) in one shot and prints a compact summary.

aidrin data-quality /path/to/sample_dataset.csv

# Output full per-feature JSON instead of summary
aidrin data-quality /path/to/sample_dataset.csv --detail

aidrin run

Runs a single metric. Use aidrin run <metric> -h to see required arguments for that metric.

# General form
aidrin run <metric-name> /path/to/sample_dataset.csv [metric-specific args]

# Shortcut: omit the "run" subcommand
aidrin <metric-name> /path/to/sample_dataset.csv [metric-specific args]

Examples:

# Data quality (no extra args needed)
aidrin run completeness /path/to/sample_dataset.csv
aidrin run duplicity /path/to/sample_dataset.csv
aidrin run outliers /path/to/sample_dataset.csv

# Impact on AI
aidrin run correlations /path/to/sample_dataset.csv "age,income,credit_score"
aidrin run feature-relevance /path/to/sample_dataset.csv "gender,education" "age,income,credit_score" approved

# Fairness & bias
aidrin run class-imbalance /path/to/sample_dataset.csv approved
aidrin run statistical-rates /path/to/sample_dataset.csv approved gender
aidrin run representation-rate /path/to/sample_dataset.csv "gender,ethnicity"

# Data governance / privacy
aidrin run k-anonymity /path/to/sample_dataset.csv "age,zipcode,gender"
aidrin run l-diversity /path/to/sample_dataset.csv "age,zipcode" diagnosis
aidrin run t-closeness /path/to/sample_dataset.csv "age,zipcode" diagnosis
aidrin run entropy-risk /path/to/sample_dataset.csv "age,zipcode,gender"

Options available on all run subcommands:

Flag

Description

-v, --verbose

Show progress output while the metric runs

aidrin batch

Runs a set of metrics defined in a JSON or YAML config file. Useful for reproducible pipelines.

aidrin batch /path/to/my_project/batch_config.yaml
aidrin batch /path/to/my_project/batch_config.yaml -v          # verbose

Results are printed as JSON to stdout. Redirect to a file to save:

aidrin batch /path/to/my_project/batch_config.yaml > results.json

Config file format (YAML):

file-path: /path/to/sample_dataset.csv
file-type: .csv

metrics:
  - completeness
  - duplicity
  - outliers
  - class-imbalance

target-column: approved

Example — fairness analysis on the sample dataset:

# /path/to/my_project/fairness_config.yaml

file-path: /path/to/sample_dataset.csv
file-type: .csv

metrics:
  - statistical-rates
  - representation-rate
  - class-imbalance

target-column: approved
sensitive-attribute-column: ethnicity
columns:
  - gender
  - ethnicity
aidrin batch /path/to/my_project/fairness_config.yaml > fairness_results.json

aidrin add-custom-module

Scaffolds a new custom metric module in a directory of your choice. Custom metrics live entirely outside the AIDRIN package — you own the file.

aidrin add-custom-module my_audit --dir /path/to/my_project

This creates /path/to/my_project/my_audit.py with a metric() and a remedy() method. Edit those methods to add your logic, then run by passing the file path directly:

aidrin run custom /path/to/my_project/my_audit.py /path/to/sample_dataset.csv metric    # run the metric
aidrin run custom /path/to/my_project/my_audit.py /path/to/sample_dataset.csv remedy    # run the remedy

The remedy output CSV is saved to a remedy_data/ folder next to the module file (/path/to/my_project/remedy_data/my_audit_remedy.csv).


Available Metrics

Category

Metric

Required Args

Data Quality

completeness

Data Quality

duplicity

Data Quality

outliers

Impact on AI

correlations

columns

Impact on AI

feature-relevance

categorical-columns, numerical-columns, target-column

Fairness & Bias

class-imbalance

target-column

Fairness & Bias

statistical-rates

target-column, sensitive-attribute-column

Fairness & Bias

representation-rate

columns

Data Governance

k-anonymity

quasi-identifiers

Data Governance

l-diversity

quasi-identifiers, sensitive-column

Data Governance

t-closeness

quasi-identifiers, sensitive-column

Data Governance

entropy-risk

quasi-identifiers

Data Governance

single-attribute-risk

id-column, eval-columns

Data Governance

multiple-attribute-risk

id-column, eval-columns

Custom

custom

<name-or-path>, varies — see aidrin run custom -h

Metric and category names accept either dashes or underscores interchangeably (e.g. class-imbalance and class_imbalance are equivalent).


Using AIDRIN as a Python Library

All CLI metrics are also available as a Python API for use in notebooks or scripts:

from aidrin.headless import run_metric, run_data_quality, run_batch_metrics
from aidrin.headless import HeadlessConfig

# Single metric
result = run_metric("completeness", "/path/to/sample_dataset.csv")

# Fast data quality bundle
result = run_data_quality("/path/to/sample_dataset.csv")

# Batch from config
config = HeadlessConfig.from_file("/path/to/my_project/batch_config.yaml")
result = run_batch_metrics(config)

For the web interface’s lower-level functional API, see the Web Application Usage page.


Agentic Evaluation

The agentic evaluation component extends the AIDRIN CLI with a question-answering layer for domain-aware data readiness assessment. Where the aidrin CLI runs quantitative, metric-driven evaluations, the agentic component lets you pose natural-language questions about your data against a body of domain literature — papers, regulatory documents, standards — and receive evidence-backed answers along with actionable remediation recommendations.

Note

The agentic component is an optional extra (aidrin[agentic]) because it requires LLM API access and additional dependencies not needed for standard CLI or web interface use. See CLI Installation for installation instructions.

How It Works

Each question is processed through a five-stage pipeline:

  1. Data Profiler — loads the dataset and computes compact summary statistics (row/column counts, means, missing-value ratios, top categories) to give the LLM structural context about the data.

  2. Vector Retriever — searches a pre-built FAISS vector index of your domain literature (PDFs, text files) to retrieve the most relevant passages for each question. When retrieval is disabled, the LLM answers from its own knowledge and the dataset profile alone.

  3. Code Executor — uses the retrieved passages and dataset profile to prompt an LLM to write executable Python/pandas code, then runs that code directly against the dataset. A self-healing loop automatically repairs failing code, up to a configurable number of attempts (executor.max_attempts).

  4. Complexity Scorer — classifies each query as easy, moderate, or hard based on three dimensions: profile dependency, domain-knowledge dependency, and code complexity.

  5. Remediation Generator — synthesises concrete, domain-grounded remediation recommendations for each finding, citing the same domain literature used during retrieval.

Multiple questions are processed in parallel via a configurable thread pool (retrieval.max_workers). Results are printed to stdout and optionally written to a JSON file.

Commands

Two subcommands are available under aidrin agentic:

# Build (or rebuild) the vector index from your domain literature
aidrin agentic build-index -c /path/to/my_project/config.yaml

# Run the full evaluation pipeline
aidrin agentic run -c /path/to/my_project/config.yaml -o /path/to/my_project/results.json

# Run without rebuilding the index (use an existing one)
aidrin agentic run -c /path/to/my_project/config.yaml --skip-vector -o results.json

All paths in config.yaml are resolved relative to the config file itself, so your project directory can live anywhere on disk.

Flag

Description

-c / --config

Path to the YAML config file (required)

-o / --output

Path to write JSON results (optional; results are always printed to stdout regardless)

--skip-vector

Skip rebuilding the vector index and use the existing one (run only)

-v / --verbose

Print vector build progress to stderr

Quickstart: UCI Power Consumption Dataset

This end-to-end example uses the UCI Individual Household Electric Power Consumption dataset — a real-world time-series dataset of ~2 million minute-level household energy readings.

Step 1: Add the dataset

The repository includes a ready-to-use example project at examples/agentic/power_consumption/. Download household_power_consumption.zip from the UCI link above, extract it, and place the .txt file in the data/ subdirectory:

Step 2: Add domain literature

Download papers that cite the dataset (published 2016–2026) in the link above, and place their PDFs in examples/agentic/power_consumption/sources/.

Your example project directory should then look like this:

examples/agentic/power_consumption/
├── config.yaml          ← pre-configured, ready to use
├── loader.py            ← handles the semicolon-separated .txt format
├── data/
│   ├── metadata.txt                          ← included in repo
│   └── household_power_consumption.txt       ← add this
└── sources/
    ├── paper1.pdf                            ← add one or more PDFs here
    └── ...

Step 3: Configure your API key and endpoint

The included config.yaml is pre-configured for the standard OpenAI API. Set the environment variable to the key your endpoint requires:

export OPENAI_API_KEY="sk-..."

If you are using a different OpenAI compatible endpoint (e.g. LBL CBORG), update llm.base_url in the config and ensure all model and embedding_model values are names available on that provider:

# examples/agentic/power_consumption/config.yaml

paths:
  data_loader: "./loader.py:load_dataset"
  metadata_csv: "./data/metadata.txt"

llm:
  base_url: "https://api.openai.com/v1"   # replace with your OpenAI compatible endpoint, e.g. https://api.cborg.lbl.gov

profiling:
  full_summary: false

vector_store:
  sources:
    - ./sources
  embedding_model: text-embedding-ada-002
  chunk_size: 1000
  chunk_overlap: 200
  vector_store_name: power_consumption_index

retrieval:
  enabled: true
  max_workers: 4
  answer_model: gpt-4o
  top_k: 3
  question:
    - "Which European Union regulation is cited as requiring that the consequences of profiling be informed to the data subject? Return the name of the regulation as a string."
    - "Return True if more than 80% of the data is resampled to align with the widely adopted industry standards for smart meter technology to reduce behavioral noise. Return False if not."

executor:
  enabled: true
  max_attempts: 5
  model: gpt-4o
  temperature: 0.0

complexity_scorer:
  enabled: true
  model: gpt-4o

remediation:
  enabled: true
  model: gpt-4o
  context_chars: 3000

output:
  save_log: true

Step 4: Build the vector index

Run this once. Re-run only when your literature changes:

aidrin agentic build-index -c examples/agentic/power_consumption/config.yaml

Step 5: Run the pipeline

aidrin agentic run \
  -c examples/agentic/power_consumption/config.yaml \
  -o examples/agentic/power_consumption/results.json

On subsequent runs, skip rebuilding the index with --skip-vector:

aidrin agentic run \
  -c examples/agentic/power_consumption/config.yaml \
  --skip-vector \
  -o examples/agentic/power_consumption/results.json

Results are printed to stdout and written to examples/agentic/power_consumption/results.json. Each result includes the question, the answer, the retrieved passages that informed it, the generated code (if applicable), a complexity classification, and remediation recommendations.

Using Your Own Dataset

Follow the same steps, substituting your own dataset and literature.

Step 1: Set up your project directory

~/my_project/
├── config.yaml
├── loader.py
├── data/
│   ├── my_data.csv          # your dataset
│   └── metadata.txt         # column-level descriptions (plain text or CSV)
└── sources/                 # domain literature to index (PDF or TXT)
    ├── reference.pdf
    └── standards.txt

Step 2: Define your custom data loader

Implement loader.py in your project directory with a function load_dataset that returns a pandas.DataFrame of your dataset. This gives you full control over loading logic, and support for any file format.:

# ~/my_project/loader.py
import pandas as pd
from pathlib import Path

def load_dataset() -> pd.DataFrame:
    return pd.read_csv(Path(__file__).parent / "data/my_data.csv")

Then reference it in config.yaml via paths.data_loader (see Step 4). For datasets that require more complex loading — multiple files, Parquet, HDF5, etc. — replace the body of load_dataset with whatever logic you need; the only requirement is that it returns a pandas.DataFrame.

Step 3: Add domain literature

Place your domain literature (PDFs, text files) in sources/ and your dataset in data/.

Step 4: Configure your API key and write a config file

Set the environment variable to the key your endpoint requires:

export OPENAI_API_KEY="sk-..."

Then create config.yaml. If you are using a different OpenAI compatible endpoint (e.g. LBL CBORG), update llm.base_url to point to it and ensure all model and embedding_model values are names available on that provider:

# ~/my_project/config.yaml

llm:
  base_url: "https://api.openai.com/v1"   # replace with your OpenAI-compatible endpoint, e.g. https://api.cborg.lbl.gov

paths:
  data_loader: "./loader.py:load_dataset"   # module:function relative to config dir
  metadata_csv: "./data/metadata.txt"

profiling:
  full_summary: false

vector_store:
  sources:
    - ./sources
  embedding_model: text-embedding-ada-002
  chunk_size: 1000
  chunk_overlap: 200
  vector_store_name: my_project_index # name for the FAISS index that will be created

retrieval:
  enabled: true
  max_workers: 8
  answer_model: gpt-5.2
  top_k: 3
  question:
    - "Does the age feature satisfy the HIPAA Safe Harbor de-identification standard?"

executor:
  enabled: true
  max_attempts: 5
  model: gpt-5.2
  temperature: 0.0

complexity_scorer:
  enabled: true
  model: gpt-5.2

remediation:
  enabled: true
  model: gpt-5.2
  context_chars: 3000

Step 5: Build the index and run

aidrin agentic build-index -c ~/my_project/config.yaml
aidrin agentic run -c ~/my_project/config.yaml -o ~/my_project/results.json