Skip to content

Latest commit

 

History

History
279 lines (219 loc) · 9.06 KB

File metadata and controls

279 lines (219 loc) · 9.06 KB

Deterministic Runs

This document explains how to run the training pipeline with bitwise reproducible results for financial audit compliance and research reproducibility.

Quick Start

Run with Strict Determinism

bin/run_deterministic.sh python TRAINING/orchestration/intelligent_trainer.py --experiment-config determinism_test

Run Normal (Best Effort)

python TRAINING/orchestration/intelligent_trainer.py --experiment-config your_config

Modes

Mode Command Reproducibility Performance
Strict bin/run_deterministic.sh python ... Bitwise identical CPU only (GPU disabled)
Best Effort python ... Seeded but may vary Full GPU/parallelism

What Causes Non-Determinism?

GPU is the primary source of non-determinism. GPU operations use parallel floating-point arithmetic where operation ordering is not guaranteed, causing slight numerical differences between runs.

Multithreading with CPU is generally deterministic when models are properly seeded. The pipeline sets seeds for all randomness sources, so multithreaded CPU execution typically produces identical results.

What Strict Mode Does

The launcher script (bin/run_deterministic.sh) sets critical environment variables before Python starts:

PYTHONHASHSEED=42        # Deterministic Python hash
REPRO_MODE=strict        # Enable strict mode
OMP_NUM_THREADS=1        # Single-threaded OpenMP (conservative)
MKL_NUM_THREADS=1        # Single-threaded MKL (conservative)
CUBLAS_WORKSPACE_CONFIG=:4096:8  # CUDA determinism (if GPU used)

The training pipeline then:

  • Forces tree models to use device_type=cpu (no GPU) - this is the critical setting
  • Sets deterministic=True for LightGBM
  • Uses n_jobs=1 (conservative, but multithreading may work)
  • Uses SHA256-based seed derivation for stability
  • Injects seeds into all model configs automatically

Verification

Run Twice and Compare

# Run 1
bin/run_deterministic.sh python TRAINING/orchestration/intelligent_trainer.py \
    --experiment-config determinism_test 2>&1 | tee run1.log

# Run 2
bin/run_deterministic.sh python TRAINING/orchestration/intelligent_trainer.py \
    --experiment-config determinism_test 2>&1 | tee run2.log

# Compare fingerprints (should be identical)
diff <(grep fingerprint run1.log) <(grep fingerprint run2.log)

Verify Using Snapshots

Each run produces snapshot.json files with determinism-relevant signatures:

# Find snapshots from your run
find RESULTS/runs/*/targets/*/reproducibility -name "snapshot.json"

# Compare comparison_group fields between runs
jq '.comparison_group' run1/targets/fwd_ret_10m/reproducibility/CROSS_SECTIONAL/cohort=*/snapshot.json

Key fields to compare:

Field Purpose
dataset_signature Hash of data_dir + symbols
task_signature Hash of target config
routing_signature Hash of view + symbol routing
hyperparameters_signature Hash of model hyperparameters
train_seed Random seed used
metrics_sha256 Hash of output metrics

If all signatures match between runs, the configuration is identical.

What to Expect

  • Identical fingerprints = determinism working
  • Different fingerprints = something is non-deterministic

Config Requirements

For guaranteed deterministic runs, your experiment config should disable GPU:

# The critical setting - GPU is the source of non-determinism
reproducibility:
  mode: strict
  strict:
    disable_gpu_tree_models: true  # REQUIRED for determinism

Optional (conservative) settings:

# These are set by the launcher script but may not be strictly necessary
# if all models are properly seeded
multi_target:
  parallel_targets: false  # Sequential (conservative)

threading:
  parallel:
    enabled: false  # Disable parallelism (conservative)

Note: Multithreading with properly seeded CPU models typically produces identical results. The single-threaded settings are conservative guarantees, not strict requirements.

See CONFIG/experiments/determinism_test.yaml for a complete example.

Architecture

bin/run_deterministic.sh
    │
    ├── Sets PYTHONHASHSEED=42 (before Python starts)
    ├── Sets REPRO_MODE=strict
    ├── Sets OMP_NUM_THREADS=1, MKL_NUM_THREADS=1
    │
    ▼
intelligent_trainer.py
    │
    ├── import repro_bootstrap (FIRST - sets thread env vars)
    ├── Validates PYTHONHASHSEED is set
    ├── Checks no numeric libs imported before bootstrap
    │
    ▼
determinism.py
    │
    ├── load_reproducibility_config() - ENV overrides YAML
    ├── seed_all() - Sets Python/NumPy/Torch seeds
    ├── create_estimator() - Single choke point for model creation
    │   ├── Applies n_jobs=1 in strict mode
    │   ├── Applies device_type=cpu in strict mode
    │   └── Uses normalize_seed() to prevent edge cases
    │
    ▼
Bitwise Identical Results

Output Directory Structure

Reproducibility artifacts are organized under each target:

RESULTS/runs/<run_name>/
└── targets/<target>/
    └── reproducibility/
        ├── CROSS_SECTIONAL/
        │   ├── universe=<sig>/           # Identifies symbol set
        │   │   └── feature_importances/  # Per-model CSV files
        │   └── cohort=cs_<id>/
        │       ├── snapshot.json         # Full snapshot with signatures
        │       ├── metrics.json          # Output metrics
        │       ├── metadata.json         # Run metadata
        │       └── diff_prev.json        # Comparison to previous run
        └── SYMBOL_SPECIFIC/
            └── symbol=AAPL/              # Per-symbol directory
                ├── feature_importances/  # Per-model CSV files
                └── cohort=sy_<id>/
                    ├── snapshot.json
                    ├── metrics.json
                    ├── metadata.json
                    └── diff_prev.json

Key directories:

  • CROSS_SECTIONAL/ - Multi-symbol (panel) runs
  • SYMBOL_SPECIFIC/symbol=XXX/ - Per-symbol runs

Configuration Files

File Purpose
CONFIG/pipeline/training/reproducibility.yaml Default determinism settings
CONFIG/experiments/determinism_test.yaml Test config with parallelism disabled
bin/run_deterministic.sh Launcher script

reproducibility.yaml

reproducibility:
  mode: strict  # or best_effort
  seed: 42
  version: v1
  
  strict:
    require_env_vars: true
    disable_gpu_tree_models: true   # CRITICAL: GPU causes non-determinism
    force_single_thread: true        # Conservative (may not be required)
    enforce_stable_ordering: true    # Sort features/targets for stability

Key insight: disable_gpu_tree_models: true is the critical setting. GPU floating-point operations have non-deterministic ordering. CPU operations with proper seeding are deterministic even with multithreading.

For Production Financial Use

Audit Trail

  1. Always use strict mode for compliance-critical runs
  2. Store fingerprints in your audit database
  3. Compare fingerprints to detect code/environment drift

Environment Pinning

Strict determinism only guarantees reproducibility within the same environment:

  • Same Python version
  • Same library versions (lightgbm, xgboost, numpy, etc.)
  • Same CPU architecture

Pin versions in requirements.txt or use Docker for full reproducibility.

Regulatory Compliance

  • Run determinism test before production deployments
  • Store prediction hashes alongside model artifacts
  • Document any fingerprint changes in change log

Troubleshooting

"PYTHONHASHSEED not set"

Use the launcher script:

bin/run_deterministic.sh python your_script.py

"Bootstrap imported too late"

Ensure repro_bootstrap is imported FIRST in your entrypoint:

import TRAINING.common.repro_bootstrap  # MUST be first
import numpy as np  # Now safe

Different fingerprints across runs

Check:

  1. Parallelism is disabled in config
  2. Using the launcher script
  3. Same library versions
  4. No external randomness (e.g., shuffled data loading)

API Reference

Key Functions

from TRAINING.common.determinism import (
    create_estimator,      # Single choke point for model creation
    seed_all,              # Set all random seeds
    resolve_seed,          # SHA256-based seed derivation
    is_strict_mode,        # Check if strict mode enabled
    stable_sort,           # Deterministic ordering
    load_reproducibility_config,  # Load SST config
)

Example Usage

import TRAINING.common.repro_bootstrap  # FIRST!

from TRAINING.common.determinism import create_estimator, seed_all, resolve_seed

seed_all(42)

# Create model with determinism params automatically applied
model = create_estimator(
    library="lightgbm",
    base_config={"n_estimators": 100},
    seed=resolve_seed(42, "training", target="fwd_ret_10m"),
    problem_kind="regression"
)