delta-audit

Benchmarks

This page describes the benchmark experiments and how to reproduce the results from the paper.

Overview

The Delta-Audit benchmark consists of 45 experiments covering:

5 algorithms: Logistic Regression, SVM, Random Forest, Gradient Boosting, KNN
3 datasets: Breast Cancer, Wine, Digits
3 pairs per algorithm: Different hyperparameter configurations

Experimental Setup

Datasets

Dataset	Type	Samples	Features	Classes
Breast Cancer	Binary	569	30	2
Wine	Multi-class	178	13	3
Digits	Multi-class	1797	64	10

Algorithms and Pairs

Logistic Regression (logreg)

Pair	Model A	Model B	Description
pair1	C=1.0, penalty=l2, solver=lbfgs	C=0.1, penalty=l2, solver=lbfgs	Regularization strength
pair2	C=1.0, penalty=l2, solver=liblinear	C=1.0, penalty=l1, solver=liblinear	Penalty type
pair3	C=1.0, penalty=l2, solver=lbfgs	C=1.0, penalty=l2, solver=saga	Solver algorithm

Support Vector Classification (svc)

Pair	Model A	Model B	Description
pair1	C=1.0, kernel=rbf, gamma=scale	C=1.0, kernel=linear, gamma=scale	Kernel type
pair2	C=1.0, kernel=rbf, gamma=scale	C=1.0, kernel=rbf, gamma=auto	Gamma parameter
pair3	C=1.0, kernel=poly, degree=3, gamma=scale	C=1.0, kernel=rbf, gamma=scale	Kernel complexity

Random Forest (rf)

Pair	Model A	Model B	Description
pair1	n_estimators=100, max_depth=None, max_features=None	n_estimators=300, max_depth=None, max_features=None	Ensemble size
pair2	n_estimators=200, max_depth=None, max_features=None	n_estimators=200, max_depth=5, max_features=None	Tree depth
pair3	n_estimators=200, max_depth=None, max_features=sqrt	n_estimators=200, max_depth=None, max_features=log2	Feature selection

Gradient Boosting (gb)

Pair	Model A	Model B	Description
pair1	n_estimators=150, learning_rate=0.1, max_depth=3	n_estimators=150, learning_rate=0.05, max_depth=3	Learning rate
pair2	n_estimators=100, learning_rate=0.1, max_depth=3	n_estimators=200, learning_rate=0.1, max_depth=3	Ensemble size
pair3	n_estimators=150, learning_rate=0.1, max_depth=3	n_estimators=150, learning_rate=0.1, max_depth=5	Tree depth

K-Nearest Neighbors (knn)

Pair	Model A	Model B	Description
pair1	n_neighbors=5, weights=uniform, algorithm=auto	n_neighbors=10, weights=uniform, algorithm=auto	Number of neighbors
pair2	n_neighbors=5, weights=uniform, algorithm=auto	n_neighbors=5, weights=distance, algorithm=auto	Weighting scheme
pair3	n_neighbors=5, weights=uniform, algorithm=auto	n_neighbors=5, weights=uniform, algorithm=ball_tree	Algorithm type

Reproducing Results

Prerequisites

# Install Delta-Audit
pip install -e .
pip install -r requirements.txt

Run Full Benchmark

# Run all 45 experiments
delta-audit run --config configs/full_benchmark.yaml

This will:

Train model pairs across all algorithms and datasets
Compute Δ-Attribution metrics
Save results to results/delta_summary.csv
Save standard metrics to results/standard_summary.csv

Generate Figures

# Generate all figures from results
delta-audit figures --summary results/_summary --out results/figures/

Check Results

# Run sanity checks
delta-audit check

Expected Results

Key Statistics

Metric	Expected Range	Description
BAC	0.1 - 0.8	Behavioral Alignment Coefficient
DCE	0.01 - 0.5	Differential Conservation Error
Δ Magnitude L1	0.1 - 2.0	Attribution change magnitude
Rank Overlap @10	0.2 - 0.8	Feature overlap
JSD	0.01 - 0.3	Distributional shift

Performance Impact

Improvements: ~40-60% of experiments show accuracy improvement
Degradations: ~20-40% of experiments show accuracy degradation
No change: ~10-20% of experiments show no accuracy change

Algorithm Rankings

Best BAC (typically):

Logistic Regression
Random Forest
Gradient Boosting
SVM
KNN

Lowest DCE (typically):

Logistic Regression
Random Forest
Gradient Boosting
SVM
KNN

Computational Requirements

Time Requirements

Component	Time (CPU)	Time (GPU)
Quickstart	2-5 minutes	1-2 minutes
Full benchmark	10-30 minutes	5-15 minutes
Figure generation	1-2 minutes	1-2 minutes

Memory Requirements

Quickstart: ~500MB RAM
Full benchmark: ~2GB RAM
Figure generation: ~1GB RAM

Hardware Recommendations

CPU: Multi-core processor (4+ cores recommended)
RAM: 4GB+ for full benchmark
Storage: 1GB+ free space for results and figures

Troubleshooting

Common Issues

Memory errors: Reduce batch size or use fewer algorithms
Slow execution: Use fewer datasets or algorithm pairs
Import errors: Ensure all dependencies are installed
File not found: Check that configuration files exist

Performance Optimization

# Run with specific algorithms only
cat > custom_config.yaml << EOF
datasets:
  - wine
  - digits

algo_pairs:
  logreg:
    - A: {C: 1.0, penalty: l2, solver: lbfgs}
      B: {C: 0.1, penalty: l2, solver: lbfgs}
      pair_name: pair1
EOF

delta-audit run --config custom_config.yaml

Validation

To validate your results:

Check file sizes: Results files should be ~50-100KB
Verify metrics: BAC should be in [-1, 1], DCE should be positive
Compare with paper: Key statistics should match published results
Run quickstart: Should complete in <5 minutes

Extending Benchmarks

Adding New Algorithms

Add algorithm configuration to configs/full_benchmark.yaml
Implement algorithm in src/delta_audit/runners.py
Test with quickstart first

Adding New Datasets

Add dataset loading function to src/delta_audit/runners.py
Add dataset to configuration file
Ensure proper preprocessing (scaling, etc.)

Custom Metrics

Implement metric in src/delta_audit/metrics.py
Add to compute_all_metrics function
Update documentation and tests

This site is open source. Improve this page.