CLI Usage
Quick Start
# Fast data quality assessment (completeness, duplicates, outliers)
aidrin data-quality /path/to/sample_dataset.csv
# List all available metrics
aidrin list
# Run a single metric
aidrin run completeness /path/to/sample_dataset.csv
# Run a batch of metrics from a YAML config
aidrin batch /path/to/my_project/batch_config.yaml
Sample Dataset
All examples on this page use a single CSV file. You can also find ready-to-use sample
datasets in examples/sample_data/ in the repository.
For the following example, run the following Python snippet
once to generate a synthetic datasets yourself — then substitute /path/to/sample_dataset.csv with the
actual path where you saved it:
import pandas as pd
data = {
"age": [34, 28, 42, 31, 25, 34, 38, 45, 29, 33],
"income": [75000, 45000, 95000, 55000, 35000, 75000, 85000, 950000, 42000, None],
"credit_score": [720, 650, 780, 690, 600, 720, 750, 800, 620, 700],
"education": ["bachelor","high_school","master","bachelor","high_school",
"bachelor","bachelor","master","bachelor","bachelor"],
"gender": ["male","female","male","female","male","male","female","male","female","male"],
"ethnicity": ["white","hispanic","asian","black","white","white","asian","white","hispanic","black"],
"zipcode": [43201, 43201, 43202, 43202, 43203, 43201, 43203, 43204, 43204, 43205],
"diagnosis": ["diabetes","hypertension","diabetes","hypertension","asthma",
"diabetes","hypertension","asthma","diabetes","hypertension"],
"approved": [1, 0, 1, 1, 0, 1, 1, 1, 0, 0],
}
pd.DataFrame(data).to_csv("/path/to/sample_dataset.csv", index=False)
The dataset intentionally includes one duplicate row (rows 1 and 6), one missing
income value (row 10), and one income outlier (row 8) so that completeness,
duplicity, and outlier metrics return non-trivial results.
Commands
aidrin list
Lists all available metrics grouped by category.
aidrin list
# Filter by category
aidrin list --category data-quality
aidrin data-quality
Runs the three core data quality metrics (completeness, duplicity, and outliers) in one shot and prints a compact summary.
aidrin data-quality /path/to/sample_dataset.csv
# Output full per-feature JSON instead of summary
aidrin data-quality /path/to/sample_dataset.csv --detail
aidrin run
Runs a single metric. Use aidrin run <metric> -h to see required arguments for that metric.
# General form
aidrin run <metric-name> /path/to/sample_dataset.csv [metric-specific args]
# Shortcut: omit the "run" subcommand
aidrin <metric-name> /path/to/sample_dataset.csv [metric-specific args]
Examples:
# Data quality (no extra args needed)
aidrin run completeness /path/to/sample_dataset.csv
aidrin run duplicity /path/to/sample_dataset.csv
aidrin run outliers /path/to/sample_dataset.csv
# Impact on AI
aidrin run correlations /path/to/sample_dataset.csv "age,income,credit_score"
aidrin run feature-relevance /path/to/sample_dataset.csv "gender,education" "age,income,credit_score" approved
# Fairness & bias
aidrin run class-imbalance /path/to/sample_dataset.csv approved
aidrin run statistical-rates /path/to/sample_dataset.csv approved gender
aidrin run representation-rate /path/to/sample_dataset.csv "gender,ethnicity"
# Data governance / privacy
aidrin run k-anonymity /path/to/sample_dataset.csv "age,zipcode,gender"
aidrin run l-diversity /path/to/sample_dataset.csv "age,zipcode" diagnosis
aidrin run t-closeness /path/to/sample_dataset.csv "age,zipcode" diagnosis
aidrin run entropy-risk /path/to/sample_dataset.csv "age,zipcode,gender"
Options available on all run subcommands:
Flag |
Description |
|---|---|
|
Show progress output while the metric runs |
aidrin batch
Runs a set of metrics defined in a JSON or YAML config file. Useful for reproducible pipelines.
aidrin batch /path/to/my_project/batch_config.yaml
aidrin batch /path/to/my_project/batch_config.yaml -v # verbose
Results are printed as JSON to stdout. Redirect to a file to save:
aidrin batch /path/to/my_project/batch_config.yaml > results.json
Config file format (YAML):
file-path: /path/to/sample_dataset.csv
file-type: .csv
metrics:
- completeness
- duplicity
- outliers
- class-imbalance
target-column: approved
Example — fairness analysis on the sample dataset:
# /path/to/my_project/fairness_config.yaml
file-path: /path/to/sample_dataset.csv
file-type: .csv
metrics:
- statistical-rates
- representation-rate
- class-imbalance
target-column: approved
sensitive-attribute-column: ethnicity
columns:
- gender
- ethnicity
aidrin batch /path/to/my_project/fairness_config.yaml > fairness_results.json
aidrin add-custom-module
Scaffolds a new custom metric module in a directory of your choice. Custom metrics live entirely outside the AIDRIN package — you own the file.
aidrin add-custom-module my_audit --dir /path/to/my_project
This creates /path/to/my_project/my_audit.py with a metric() and a remedy() method.
Edit those methods to add your logic, then run by passing the file path directly:
aidrin run custom /path/to/my_project/my_audit.py /path/to/sample_dataset.csv metric # run the metric
aidrin run custom /path/to/my_project/my_audit.py /path/to/sample_dataset.csv remedy # run the remedy
The remedy output CSV is saved to a remedy_data/ folder next to the module file
(/path/to/my_project/remedy_data/my_audit_remedy.csv).
Available Metrics
Category |
Metric |
Required Args |
|---|---|---|
Data Quality |
|
— |
Data Quality |
|
— |
Data Quality |
|
— |
Impact on AI |
|
|
Impact on AI |
|
|
Fairness & Bias |
|
|
Fairness & Bias |
|
|
Fairness & Bias |
|
|
Data Governance |
|
|
Data Governance |
|
|
Data Governance |
|
|
Data Governance |
|
|
Data Governance |
|
|
Data Governance |
|
|
Custom |
|
|
Metric and category names accept either dashes or underscores interchangeably
(e.g. class-imbalance and class_imbalance are equivalent).
Using AIDRIN as a Python Library
All CLI metrics are also available as a Python API for use in notebooks or scripts:
from aidrin.headless import run_metric, run_data_quality, run_batch_metrics
from aidrin.headless import HeadlessConfig
# Single metric
result = run_metric("completeness", "/path/to/sample_dataset.csv")
# Fast data quality bundle
result = run_data_quality("/path/to/sample_dataset.csv")
# Batch from config
config = HeadlessConfig.from_file("/path/to/my_project/batch_config.yaml")
result = run_batch_metrics(config)
For the web interface’s lower-level functional API, see the Web Application Usage page.
Agentic Evaluation
The agentic evaluation component extends the AIDRIN CLI with a question-answering layer for
domain-aware data readiness assessment. Where the aidrin CLI runs quantitative, metric-driven
evaluations, the agentic component lets you pose natural-language questions about your data against
a body of domain literature — papers, regulatory documents, standards — and receive evidence-backed
answers along with actionable remediation recommendations.
Note
The agentic component is an optional extra (aidrin[agentic]) because it requires LLM API
access and additional dependencies not needed for standard CLI or web interface use.
See CLI Installation for installation instructions.
How It Works
Each question is processed through a five-stage pipeline:
Data Profiler — loads the dataset and computes compact summary statistics (row/column counts, means, missing-value ratios, top categories) to give the LLM structural context about the data.
Vector Retriever — searches a pre-built FAISS vector index of your domain literature (PDFs, text files) to retrieve the most relevant passages for each question. When retrieval is disabled, the LLM answers from its own knowledge and the dataset profile alone.
Code Executor — uses the retrieved passages and dataset profile to prompt an LLM to write executable Python/pandas code, then runs that code directly against the dataset. A self-healing loop automatically repairs failing code, up to a configurable number of attempts (
executor.max_attempts).Complexity Scorer — classifies each query as
easy,moderate, orhardbased on three dimensions: profile dependency, domain-knowledge dependency, and code complexity.Remediation Generator — synthesises concrete, domain-grounded remediation recommendations for each finding, citing the same domain literature used during retrieval.
Multiple questions are processed in parallel via a configurable thread pool
(retrieval.max_workers). Results are printed to stdout and optionally written to a JSON file.
Commands
Two subcommands are available under aidrin agentic:
# Build (or rebuild) the vector index from your domain literature
aidrin agentic build-index -c /path/to/my_project/config.yaml
# Run the full evaluation pipeline
aidrin agentic run -c /path/to/my_project/config.yaml -o /path/to/my_project/results.json
# Run without rebuilding the index (use an existing one)
aidrin agentic run -c /path/to/my_project/config.yaml --skip-vector -o results.json
All paths in config.yaml are resolved relative to the config file itself, so your project
directory can live anywhere on disk.
Flag |
Description |
|---|---|
|
Path to the YAML config file (required) |
|
Path to write JSON results (optional; results are always printed to stdout regardless) |
|
Skip rebuilding the vector index and use the existing one ( |
|
Print vector build progress to stderr |
Quickstart: UCI Power Consumption Dataset
This end-to-end example uses the UCI Individual Household Electric Power Consumption dataset — a real-world time-series dataset of ~2 million minute-level household energy readings.
Step 1: Add the dataset
The repository includes a ready-to-use example project at
examples/agentic/power_consumption/. Download
household_power_consumption.zip from the UCI link above, extract it, and place the .txt
file in the data/ subdirectory:
Step 2: Add domain literature
Download papers that cite the dataset (published 2016–2026) in the link above, and place their PDFs in
examples/agentic/power_consumption/sources/.
Your example project directory should then look like this:
examples/agentic/power_consumption/
├── config.yaml ← pre-configured, ready to use
├── loader.py ← handles the semicolon-separated .txt format
├── data/
│ ├── metadata.txt ← included in repo
│ └── household_power_consumption.txt ← add this
└── sources/
├── paper1.pdf ← add one or more PDFs here
└── ...
Step 3: Configure your API key and endpoint
The included config.yaml is pre-configured for the standard OpenAI API. Set the environment
variable to the key your endpoint requires:
export OPENAI_API_KEY="sk-..."
If you are using a different OpenAI compatible endpoint (e.g. LBL CBORG), update llm.base_url
in the config and ensure all model and embedding_model values are names available on that
provider:
# examples/agentic/power_consumption/config.yaml
paths:
data_loader: "./loader.py:load_dataset"
metadata_csv: "./data/metadata.txt"
llm:
base_url: "https://api.openai.com/v1" # replace with your OpenAI compatible endpoint, e.g. https://api.cborg.lbl.gov
profiling:
full_summary: false
vector_store:
sources:
- ./sources
embedding_model: text-embedding-ada-002
chunk_size: 1000
chunk_overlap: 200
vector_store_name: power_consumption_index
retrieval:
enabled: true
max_workers: 4
answer_model: gpt-4o
top_k: 3
question:
- "Which European Union regulation is cited as requiring that the consequences of profiling be informed to the data subject? Return the name of the regulation as a string."
- "Return True if more than 80% of the data is resampled to align with the widely adopted industry standards for smart meter technology to reduce behavioral noise. Return False if not."
executor:
enabled: true
max_attempts: 5
model: gpt-4o
temperature: 0.0
complexity_scorer:
enabled: true
model: gpt-4o
remediation:
enabled: true
model: gpt-4o
context_chars: 3000
output:
save_log: true
Step 4: Build the vector index
Run this once. Re-run only when your literature changes:
aidrin agentic build-index -c examples/agentic/power_consumption/config.yaml
Step 5: Run the pipeline
aidrin agentic run \
-c examples/agentic/power_consumption/config.yaml \
-o examples/agentic/power_consumption/results.json
On subsequent runs, skip rebuilding the index with --skip-vector:
aidrin agentic run \
-c examples/agentic/power_consumption/config.yaml \
--skip-vector \
-o examples/agentic/power_consumption/results.json
Results are printed to stdout and written to examples/agentic/power_consumption/results.json. Each result includes the question,
the answer, the retrieved passages that informed it, the generated code (if applicable), a
complexity classification, and remediation recommendations.
Using Your Own Dataset
Follow the same steps, substituting your own dataset and literature.
Step 1: Set up your project directory
~/my_project/
├── config.yaml
├── loader.py
├── data/
│ ├── my_data.csv # your dataset
│ └── metadata.txt # column-level descriptions (plain text or CSV)
└── sources/ # domain literature to index (PDF or TXT)
├── reference.pdf
└── standards.txt
Step 2: Define your custom data loader
Implement loader.py in your project directory with a function load_dataset that returns a pandas.DataFrame of your dataset.
This gives you full control over loading logic, and support for any file format.:
# ~/my_project/loader.py
import pandas as pd
from pathlib import Path
def load_dataset() -> pd.DataFrame:
return pd.read_csv(Path(__file__).parent / "data/my_data.csv")
Then reference it in config.yaml via paths.data_loader (see Step 4). For datasets that
require more complex loading — multiple files, Parquet, HDF5, etc. — replace the body of
load_dataset with whatever logic you need; the only requirement is that it returns a
pandas.DataFrame.
Step 3: Add domain literature
Place your domain literature (PDFs, text files) in sources/ and your dataset in data/.
Step 4: Configure your API key and write a config file
Set the environment variable to the key your endpoint requires:
export OPENAI_API_KEY="sk-..."
Then create config.yaml. If you are using a different OpenAI compatible endpoint (e.g. LBL CBORG), update
llm.base_url to point to it and ensure all model and embedding_model values are names
available on that provider:
# ~/my_project/config.yaml
llm:
base_url: "https://api.openai.com/v1" # replace with your OpenAI-compatible endpoint, e.g. https://api.cborg.lbl.gov
paths:
data_loader: "./loader.py:load_dataset" # module:function relative to config dir
metadata_csv: "./data/metadata.txt"
profiling:
full_summary: false
vector_store:
sources:
- ./sources
embedding_model: text-embedding-ada-002
chunk_size: 1000
chunk_overlap: 200
vector_store_name: my_project_index # name for the FAISS index that will be created
retrieval:
enabled: true
max_workers: 8
answer_model: gpt-5.2
top_k: 3
question:
- "Does the age feature satisfy the HIPAA Safe Harbor de-identification standard?"
executor:
enabled: true
max_attempts: 5
model: gpt-5.2
temperature: 0.0
complexity_scorer:
enabled: true
model: gpt-5.2
remediation:
enabled: true
model: gpt-5.2
context_chars: 3000
Step 5: Build the index and run
aidrin agentic build-index -c ~/my_project/config.yaml
aidrin agentic run -c ~/my_project/config.yaml -o ~/my_project/results.json