.. _usage: .. _web_usage: Web Application Usage ===================== Usage via PyPI -------------- This section describes how to use **AIDRIN** after installing it via TestPyPI. The PyPI package enables data readiness and privacy analysis functions in Python notebooks or scripts, offering a simpler setup compared to cloning the GitHub repository. Note that this method may not include the latest changes available in the repository. Installation ~~~~~~~~~~~~ Install AIDRIN from TestPyPI with: .. code-block:: bash pip install -i https://test.pypi.org/simple/ aidrin== Replace ```` with the latest version (e.g., ``0.9.7``) from the `TestPyPI page `_. Verify the installation: .. code-block:: python import aidrin print(aidrin.__version__) This displays the installed version (e.g., ``0.9.7``). Using AIDRIN Functions ~~~~~~~~~~~~~~~~~~~~~~ AIDRIN provides functions for data readiness and privacy analysis on datasets (e.g., CSV files). Below, we outline the key functions, their purpose, and what they return, using a sample dataset (``adult.csv``) as an example. You can download this dataset from the `UCI Machine Learning Repository `_. You can find sample datasets in the `examples/sample_data` directory of the repository, or download them directly from the web interface's **Sample Data** panel on the inspector page. Setting Up File Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most functions require a ``file_info`` tuple with the file path, name, and type: .. code-block:: python file_path = "path/to/adult.csv" file_name = "adult.csv" file_type = ".csv" file_info = (file_path, file_name, file_type) Available Functions ~~~~~~~~~~~~~~~~~~~ Below are AIDRIN's primary functions, their usage, and the type of output they return. calculate_completeness ^^^^^^^^^^^^^^^^^^^^^^ Evaluates dataset completeness by checking for missing values. **Usage**: .. code-block:: python from aidrin import calculate_completeness result = calculate_completeness(file_info) **Returns**: A dictionary with an overall completeness score (1 for no missing values, 0 for all missing) and a histogram of missing value proportions per column. .. note:: **HDF5 fill value handling**: HDF5 datasets encode missing data as a numeric sentinel (the *fill value*) rather than as a blank cell. When reading an ``.h5`` file AIDRIN automatically translates these sentinels to ``NaN`` before computing completeness, so the score reflects true data availability rather than always reporting 100%. Sentinels are collected from the following sources, in priority order: 1. **User-supplied** – pass ``fill_values=[v1, v2, …]`` to ``hdf5Reader`` at construction time to declare domain-specific sentinels explicitly. 2. **_FillValue attribute** – the NetCDF/CF convention used by virtually all climate, oceanography, and atmospheric HDF5 files. 3. **missing_value attribute** – the older NetCDF convention; may be a scalar or an array of multiple sentinels. 4. **HDF5 native fill value** – the value stored in the dataset's own metadata (``dataset.fillvalue``). When this equals the dtype default (``0`` / ``0.0``) and no fill-value attributes are present, a warning is logged before replacement because zero is a legitimate measurement in many scientific datasets (e.g. counts, indices). Set a ``_FillValue`` attribute in the file to an unambiguous sentinel to suppress this warning. calculate_correlations ^^^^^^^^^^^^^^^^^^^^^^ Calculates correlations between specified columns (numerical or categorical). You can specify the columns interested in analysis using the `columns` parameter. **Usage**: .. code-block:: python from aidrin import calculate_correlations result = calculate_correlations(columns=['age', 'education.num'], file_info=file_info) **Returns**: A dictionary with numerical correlation scores (automatically choosing Pearson's or Spearman's coefficient based on a normality check) and categorical correlation analysis using Theil's U statistic. It will also return a visualization (heatmap) of the correlations, and the selected numerical method is exposed under ``Correlations Analysis Numerical -> Method``. calculate_class_distribution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Analyzes class distribution in a specified column to quantify imbalance. The `column` parameter specifies the target column for analysis. It uses imbalance degree scoring to assess class balance. It measures the Euclidean distance between the actual class distribution and a perfectly balanced distribution. **Usage**: .. code-block:: python from aidrin import calculate_class_distribution result = calculate_class_distribution(column='income', file_info=file_info) **Returns**: A dictionary with an imbalance degree score and a pie chart visualization of the class distribution. calculate_duplicates ^^^^^^^^^^^^^^^^^^^^ Detects duplicate rows in the dataset. **Usage**: .. code-block:: python from aidrin import calculate_duplicates result = calculate_duplicates(file_info=file_info) **Returns**: A dictionary with the proportion of duplicate rows (0 for no duplicates). calculate_feature_relevance ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Assesses feature relevance relative to a given target column. Categorical features are encoded using one-hot encoding, and numerical features are used as-is. Then the Pearson correlation coefficient is calculated between each feature and the target column. **Usage**: .. code-block:: python from aidrin import calculate_feature_relevance result = calculate_feature_relevance(file_info=file_info, target_col='income') **Returns**: A dictionary with feature importance scores for the target column. A bar chart visualization of feature importances is also provided. calculate_outliers ^^^^^^^^^^^^^^^^^^ Identifies outliers in numerical columns using the Interquartile Range (IQR) method. This method calculates the first (Q1) and third (Q3) quartiles, computes the IQR (Q3 - Q1), and defines outliers as values below `Q1 - 1.5 * IQR` or above `Q3 + 1.5 * IQR`. The proportion of outliers is calculated for each numerical column, and an overall outlier score is derived by averaging the individual column scores. This is calculated for each numerical column. **Usage**: .. code-block:: python from aidrin import calculate_outliers result = calculate_outliers(file_info=file_info) **Returns**: A dictionary with outlier scores for each numerical column and an overall score. A bar chart visualization of outlier scores is also provided. calculate_statistical_rates ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Computes statistical rates (e.g., proportions) for groups across classes. The `sensitive_attribute_column` parameter specifies the sensitive attribute for analysis, while the `y_true_column` parameter indicates class labels. **Usage**: .. code-block:: python from aidrin import calculate_statistical_rates result = calculate_statistical_rates(sensitive_attribute_column='sex', y_true_column='income', file_info=file_info) **Returns**: A dictionary with group proportions, and a visualization (bar chart) of the proportions subdivided by class labels. compute_k_anonymity ^^^^^^^^^^^^^^^^^^^ Measures k-anonymity for specified quasi-identifier columns. It calculates the minimum k value across all equivalence classes formed by the quasi-identifiers. The risk score is derived from the minimum k value, where a higher k indicates lower re-identification risk. **Usage**: .. code-block:: python from aidrin import compute_k_anonymity result = compute_k_anonymity(quasi_identifiers=['sex', 'race'], file_info=file_info) **Returns**: A dictionary with the minimum k-anonymity value, risk score, descriptive statistics, histogram data, and a visualization (histogram). compute_l_diversity ^^^^^^^^^^^^^^^^^^^ Quantifies l-diversity for a sensitive attribute within groups defined by quasi-identifiers. It measures the diversity of sensitive attribute values in each group, with a higher l-diversity indicating better protection against attribute disclosure. **Usage**: .. code-block:: python from aidrin import compute_l_diversity result = compute_l_diversity(quasi_identifiers=['sex'], sensitive_column='race', file_info=file_info) **Returns**: A dictionary with the l-diversity value, risk score, descriptive statistics, histogram data, and a visualization (histogram). compute_t_closeness ^^^^^^^^^^^^^^^^^^^ Measures t-closeness for a sensitive attribute relative to its overall distribution. It quantifies the similarity between the distribution of a sensitive attribute in a group and its distribution in the overall dataset. A lower t-closeness value indicates better protection against attribute disclosure. **Usage**: .. code-block:: python from aidrin import compute_t_closeness result = compute_t_closeness(quasi_identifiers=['sex'], sensitive_column='sex', file_info=file_info) **Returns**: A dictionary with the t-closeness value, risk score, descriptive statistics, histogram data, and a visualization (histogram). compute_entropy_risk ^^^^^^^^^^^^^^^^^^^^ Calculates entropy risk for quasi-identifier columns. It measures the uncertainty in identifying individuals based on the quasi-identifiers. A higher entropy value indicates greater uncertainty and lower re-identification risk. **Usage**: .. code-block:: python from aidrin import compute_entropy_risk result = compute_entropy_risk(quasi_identifiers=['sex'], file_info=file_info) **Returns**: A dictionary with the entropy risk value, risk score, descriptive statistics, histogram data, and a visualization (bar chart). Local and Web Application Usage -------------------------------- AIDRIN can be used as a web application at `aidrin.io `_ or installed locally (see `Web Application Installation <./web_installation.html>`_). Both share the same codebase, but the web application is hosted on a server, eliminating the need to manage dependencies or background services like Redis, Celery, or Flask. The web interface provides a user-friendly way to evaluate datasets across six dimensions of data readiness for AI: **Data Quality**, **Impact of Data on AI**, **Fairness and Bias**, **Data Governance**, **Understandability and Usability**, and **Data Structure and Organization**. Each dimension includes specific metrics to assess dataset readiness. Web Application Workflow ~~~~~~~~~~~~~~~~~~~~~~~~ To use the AIDRIN web application: 1. **Upload a Data File**: - Navigate to the file upload page at `aidrin.io `_. or `https://127.0.0.1:5000/upload-file` if running locally. - Upload a dataset (e.g., CSV file like ``adult.csv``) via the web interface. You can download the sample dataset from the `UCI Machine Learning Repository `_. - The file is processed server-side. 2. **Select a Data Readiness Dimension**: - From the homepage, choose one of the six dimensions to evaluate. - Each dimension offers specific metrics, detailed below. 3. **Choose Metrics and Configure Parameters**: - Select the desired metrics for the chosen dimension. - Specify any required parameters (e.g., column names for analysis). - AIDRIN processes the dataset and generates results. 4. **View Results and Download Report**: - Results include downloadable data summary statistics and visualizations (e.g., histograms, bar charts, heatmaps). - A JSON report summarizing the results is available for download. - Return to the homepage to select another dimension or upload a new dataset. 5. **AI Explanations (Optional)**: - If the ``openai`` package is installed (``pip install aidrin[llm]``), a sparkle icon appears in the top-right toolbar. - Click it to configure an OpenAI-compatible API endpoint (OpenAI, Ollama, vLLM, etc.). - Once configured, each metric result will include an AI-generated explanation summarizing the key observations and implications for AI/ML. - The model name is shown in each explanation for transparency. Data Readiness Dimensions and Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Below are the six dimensions, their associated metrics, the methods used, and the outputs generated. Data Quality ^^^^^^^^^^^^ Evaluates the quality of the dataset through metrics that assess completeness, duplicates, and outliers. - **Completeness**: - **Method**: Calculates the proportion of non-missing values in the dataset. The overall completeness score is the average proportion of non-missing values across all columns. - **Parameters**: None (uses entire dataset). - **Result**: A chart with values ranging from 0 (all values missing) to 1 (no missing values) for each column in the dataset, and an overall completeness score. - **Duplicates**: - **Method**: Identifies duplicate rows by comparing all column values. The duplicity score is the proportion of duplicate rows in the dataset. - **Parameters**: None (uses entire dataset). - **Result**: A duplicity score (0 for no duplicates). - **Outliers**: - **Method**: Uses the Interquartile Range (IQR) method, calculating Q1 (first quartile), Q3 (third quartile), and IQR (Q3 - Q1). Outliers are values below `Q1 - 1.5 * IQR` or above `Q3 + 1.5 * IQR`. The outlier score is the proportion of outliers per numerical column, with an overall score averaged across columns. - **Parameters**: None (applies to all numerical columns). - **Result**: Bar chart of outlier scores per numerical column and an overall outlier score. Impact of Data on AI ^^^^^^^^^^^^^^^^^^^^ Assesses how dataset features influence AI through correlation and feature relevance analysis. - **Correlation Analysis**: - **Method**: For numerical columns, runs a normality check using the Shapiro–Wilk test (α = 0.05) on up to 5000 sampled rows; if the test does not reject normality it computes Pearson's correlation coefficient; otherwise it uses Spearman's rank correlation (both ranging from -1 to 1). When SciPy is unavailable, a skewness/kurtosis heuristic is used as a fallback. For categorical columns, it uses Theil's U statistic to measure association. - **Parameters**: Select columns for analysis (numerical and/or categorical). - **Result**: Heatmap visualization of correlation coefficients. - **Feature Relevance**: - **Method**: Encodes categorical features using one-hot encoding and uses numerical features as-is. Computes the Pearson correlation coefficient between each feature and the target column. - **Parameters**: Select a target column (e.g., `'income'`) and features to analyze. - **Result**: Bar chart of feature importance scores relative to the target column. Fairness and Bias ^^^^^^^^^^^^^^^^^ Evaluates potential biases in the dataset, particularly for classification tasks, through class imbalance and demographic metrics. - **Class Imbalance**: - **Method**: Measures the distance between the actual class distribution and a perfectly balanced distribution using an imbalance degree score. You can select the distance metric from the provided options (e.g., Euclidean distance). Also you will have to specify the target column for analysis. - **Parameters**: Target column name (e.g., `'income'`). Distance metric (e.g., `'euclidean'`). - **Result**: Pie chart of class distribution. JSON report with imbalance degree score. - **Representation Rates**: - **Method**: Calculates the proportion of each group (defined by a sensitive attribute) in the dataset. - **Parameters**: Sensitive attribute column (e.g., `'sex'`). - **Result**: Bar chart of representation rates. - **Statistical Rates**: - **Method**: Computes proportions of groups (defined by a sensitive attribute) across class labels. - **Parameters**: Sensitive attribute column (e.g., `'sex'`) and class label column (e.g., `'income'`). - **Result**: Bar chart of proportions subdivided by class labels. - **Conditional Demographic Disparity**: - **Method**: Measures disparity in outcomes across demographic groups, conditioned on other variables, to identify potential bias. - **Parameters**: Sensitive attribute column and class label column. - **Result**: Bar chart of disparity scores. Data Governance ^^^^^^^^^^^^^^^ Focuses on privacy preservation through metrics that assess anonymity and disclosure risk. - **k-Anonymity**: - **Method**: Calculates the minimum group size (k) sharing the same quasi-identifier values. A higher k indicates lower re-identification risk. - **Parameters**: List of quasi-identifier columns (e.g., `['sex', 'race']`). - **Result**: Histogram of equivalence class sizes. - **l-Diversity**: - **Method**: Quantifies the diversity of sensitive attribute values within groups defined by quasi-identifiers. A higher l value indicates better protection against attribute disclosure. - **Parameters**: Quasi-identifier columns (e.g., `['sex']`) and sensitive column (e.g., `'race'`). - **Result**: Histogram of l-diversity values. - **t-Closeness**: - **Method**: Measures the distance between the distribution of a sensitive attribute in a group and the overall dataset distribution. A lower t value indicates better privacy. - **Parameters**: Quasi-identifier columns (e.g., `['sex']`) and sensitive column (e.g., `'sex'`). - **Result**: Histogram of t-closeness values. - **Entropy Risk**: - **Method**: Measures the uncertainty in identifying individuals based on quasi-identifiers. A higher entropy value indicates lower re-identification risk. - **Parameters**: Quasi-identifier columns (e.g., `['sex']`). - **Result**: Bar chart of entropy values. - **HIPAA Compliance**: - **Method**: Scans datasets for the presence of 8 out of 18 key HIPAA-regulated identifiers as defined under the `Safe Harbor method `_. This includes detection of direct and indirect identifiers that could enable re-identification of individuals. - **Identifiers Detected**: Social Security Numbers (SSNs), email addresses, phone and fax numbers, IP addresses, URLs, Vehicle Identification Numbers (VINs), and medical or account identifiers. Additionally, US postal codes are identified using geographic validation powered by `pgeocode `_. - **Parameters**: Configuration of columns to scan or exclude. - **Result**: Flagged records with detected identifiers, including counts and classification by identifier type. Understandability and Usability ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This feature evaluates dataset metadata for compliance with the **FAIR principles** — *Findable*, *Accessible*, *Interoperable*, and *Reusable*. It ensures your dataset is well-documented, discoverable, and reusable by others. FAIR Compliance Report ''''''''''''''''''''''' The **FAIR Compliance Report** analyzes your dataset's metadata file (in **DCAT** or **DataCite JSON** format) and provides a detailed assessment against the FAIR criteria. How it Works '''''''''''''' 1. Navigate to the `FAIR Compliance Report upload page `_. 2. Upload your metadata file (**DCAT** or **DataCite JSON**). 3. The system evaluates the file against the FAIR principles and generates a structured report. FAIR Principles and Criteria ''''''''''''''''''''''''''''' The evaluation checks for the presence and quality of specific metadata elements grouped under each FAIR principle: **Findable** - ``identifier`` - ``title`` - ``description`` - ``keyword`` - ``theme`` - ``landingPage`` **Accessible** - ``accessLevel`` - ``downloadURL`` - ``mediaType`` - ``accessURL`` - ``issued`` - ``modified`` **Interoperable** - ``conformsTo`` - ``references`` - ``language`` - ``format`` - ``spatial`` - ``temporal`` **Reusable** - ``license`` - ``rights`` - ``publisher`` - ``description`` - ``format`` - ``programCode`` - ``bureauCode`` - ``contactPoint`` Output ~~~~~~ The system returns: - **FAIR compliance scores** for each principle with visualizations. - A breakdown of present and missing metadata elements. .. note:: AIDRIN focuses on the completeness and structure of your metadata. It does **not** validate the factual accuracy of the content. Data Structure and Organization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Currently, no specific metrics are implemented for this dimension. Future updates may include metrics for assessing dataset schema, format consistency, or organization. Notes ~~~~~ - **Local vs. Web Application**: - The local installation requires setting up Redis, Celery, and Flask (see `Web Application Installation <./web_installation.html>`_). The web application at `aidrin.io `_ handles these server-side, offering a no-setup alternative. - Both use the same codebase, ensuring identical functionality. The web application is ideal for users who prefer a browser-based interface. - **File Formats**: The web application supports CSV, Excel, JSON, NumPy (``.npz``), and HDF5 (``.h5``) files for data uploads, and DCAT/DataCite JSON for metadata in the Understandability and Usability dimension. For HDF5 files, fill-value sentinels (``_FillValue``, ``missing_value``, and the HDF5 native fill value) are automatically converted to ``NaN`` so that all metrics — completeness, outliers, feature relevance, and privacy — operate on accurately marked missing data. See the ``calculate_completeness`` note above for the full sentinel-resolution order. - **Visualizations**: Generated downloadable plots (e.g., histograms, bar charts, heatmaps) are displayed in the web interface. - **JSON Reports**: Each dimension's analysis generates a downloadable JSON report containing all metrics, statistics, and visualization data (where applicable). ---- Custom Metrics and Remedies ---------------------------- This section explains how to define custom metrics and remediation logic for your uploaded CSV files using the **CustomDR** class inside the CodeMirror editor. After uploading a dataset, you will navigate to a page where you can write Python code that extends the platform's data-review logic. Workflow ~~~~~~~~ 1. Navigate to the file upload page and upload a CSV file. 2. After upload, click the **Define Custom Metrics** button. You will be redirected to ``/customMetrics``. 3. A CodeMirror Python editor appears, preloaded with an editable ``CustomDR`` class that inherits from ``BaseDRAgent`` and contains two methods: - ``metric()``: returns a dictionary of metric results. - ``remedy()``: returns a modified dataset based on your remediation logic. 4. Write or modify your code inside the editor. 5. Press **Save** to store your custom logic on the server (temporary, 1-hour expiration). 6. Press **Submit** to execute your ``metric()`` function, and optionally your ``remedy()`` function if you have checked the **Apply Remedy** box. The platform will display your computed metric dictionary, any remediated dataset to download (if remedy is enabled), and any warnings or errors raised by your code. Understanding the ``CustomDR`` Base Class ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Below is the template initially shown in CodeMirror: .. code-block:: python from aidrin.custom_metrics.base_dr import BaseDRAgent from typing import Any from typing import Dict, Union, Any class CustomDR(BaseDRAgent): def __init__(self, dataset: Any, **kwargs): super().__init__(dataset, **kwargs) def metric(self, **kwargs): """ Implement your custom metric logic here. """ # IMPLEMENT YOUR METRIC LOGIC BELOW # Example: Calculating the total number of missing cells in the entire DataFrame # df: pd.DataFrame = self.dataset # return { # "total_missing_cells": df.isna().sum().to_dict() # } return {"message": "Placeholder metric. Implement your logic here."} def remedy(self, metric_results: dict): """ Applies custom remediation logic based on the calculated metrics. """ # IMPLEMENT YOUR REMEDIATION LOGIC BELOW # For example, filling null values with a default value # df_remedied: pd.DataFrame = self.dataset.copy() # df_remedied.fillna(0, inplace=True) # return df_remedied return self.dataset The goal is to replace the placeholder logic with your own custom metric and remediation steps. Implementing ``metric()``: Requirements and Tips ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Your ``metric()`` method: - Must return a **dictionary** whose keys are metric names and values are computed results. - Receives the dataset through ``self.dataset``. - May accept additional keyword arguments (depending on future UI extensions). - Should not mutate the dataset; all transformations belong in ``remedy()``. Example: Compute missing values, row count, and column datatypes. .. code-block:: python def metric(self, **kwargs): df = self.dataset return { "row_count": len(df), "column_count": df.shape[1], "missing_values": df.isna().sum().to_dict(), "dtypes": df.dtypes.apply(lambda x: str(x)).to_dict(), } Implementing ``remedy()``: Requirements and Tips ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``remedy()`` method receives the ``metric_results`` dictionary returned by ``metric()``. Use this method when you want to apply data-cleaning or transformation logic based on your computed metrics. Or, you can modify the dataset directly without relying on ``metric_results``. You must return the updated dataset at the end of ``remedy()``. Full Practical Example: A Combined Metric and Remedy Class ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The example below shows both ``metric()`` and ``remedy()`` implemented in a realistic workflow. .. code-block:: python from aidrin.custom_metrics.base_dr import BaseDRAgent import pandas as pd from typing import Dict, Union, Any class CustomDR(BaseDRAgent): """ An agent focused on detecting and removing duplicate rows. """ def __init__(self, dataset: Any, **kwargs): super().__init__(dataset, **kwargs) def metric(self, **kwargs) -> Dict[str, int]: """ Calculates the total count of duplicate rows. """ df: pd.DataFrame = self.dataset duplicate_rows_count: int = df.duplicated().sum() return { "duplicate_rows_total": duplicate_rows_count, } def remedy(self, metric_results: Dict[str, Any]) -> pd.DataFrame: """ Removes duplicate rows using the calculated metric results. """ # Create a copy for safe modification to prevent side effects on the original state. df_remedied: pd.DataFrame = self.dataset.copy() duplicate_count = metric_results.get("duplicate_rows_total", 0) if duplicate_count > 0: df_remedied.drop_duplicates(inplace=True) return df_remedied How the System Uses Your CustomDR Class ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When you click **Submit**: 1. Your code is dynamically loaded and executed in an isolated environment. 2. The system creates an instance of your ``CustomDR`` class. 3. The system calls your ``metric()`` method to compute metrics. 4. If **Apply Remedy** is checked, the system calls your ``remedy()`` method to get the modified dataset. 5. Metrics and (optionally) the remedied data preview are displayed on the results section of the page. Best Practices for Writing Custom Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - **Do not mutate the dataset inside ``metric()``.** All modifications belong in ``remedy()``. - Work on a **copy** of ``self.dataset`` in ``remedy()`` to avoid side effects. - Always return the modified dataset at the end of ``remedy()``. Data and Code Storage Rules ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Your custom metric code is stored temporarily for **1 hour**. - The (optional) remedied dataset is also stored for **1 hour**. - After expiration, all artifacts are safely removed.