v1.2.6#208
Merged
Merged
Conversation
guillaume-byte
commented
Jun 18, 2026
Member
- Fix ledgered parameters issues with ValueProxy
- Fix the bounding boxes normalization issue in the prediction column
…ration
pause_controller starts paused by default and is intended to be driven by
the UI / training loop, not the data path. Calling _wait_if_paused() from
__iter__ and __next__ meant any script that iterated the loader before an
external resume would hang forever — even at num_workers=0. Workers made
the failure mode look like a worker bug (leaked semaphores, freezes under
load), but the same hang reproduced with no workers at all.
Pause is a training-loop concern; the loader should just deliver bytes.
_wait_if_paused() itself is preserved so training loops can still call it
explicitly at safe points (between optimizer steps).
Verified:
- weightslab/tests/backend/test_data_loader_interface.py: 9/9 pass
(incl. test_dataloader_interface_uses_multiple_workers,
test_multiple_workers_parallelize_preprocessing)
- Wider tests/backend + tests/components sweep: 108 pass, 2 unrelated
pre-existing failures in test_ui_docker_bridge (cert script + Windows
path test on Linux)
- ws-detection example with num_workers={0,2,4} on CPU: clean runs,
W=2 ~21% faster than W=0, no hangs or crashes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a4cf489 removed DataLoaderInterface._wait_if_paused's call sites to fix the num_workers>0 startup deadlock, but that was also the only place enforcing the explicit pause_at_step hyperparam. Re-add the trigger in GuardContext.__enter__ (training only), before the architecture lock so the pause blocks lock-free. pause() zeroes pause_at_step, so it fires once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Its call sites were removed in a4cf489 (deadlock fix) and its pause_at_step trigger was relocated to GuardContext.__enter__ in the previous commit, so the method is now unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- main.py: refactor train loop to the infinite-generator form (re-shuffles each epoch) and default both loaders to num_workers=2 (GPU sweep: ~+76% throughput vs workers=0, sweet spot; W=4 regresses). - bench.py + run_bench.sh: configurable num_workers/epochs/wall-time harness with WL_BENCH_NO_VAL for clean throughput runs; forces the model onto the target device after watch_or_edit (see device note below). - config.yaml: local run tweaks. Note: watch_or_edit(flag="model", device=...) currently drops its device= kwarg (src.py model branch returns the proxy without honoring it), so the bench applies an explicit .to(device) workaround; the framework path is still unfixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…terface Both were only referenced by _wait_if_paused, removed in 41b76b8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These were local correctness-check tooling for the num_workers sweep, not part of the example. Kept on disk locally (untracked), removed from version control. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main.py (runnable train-loop fix) moves to a dedicated fix branch off dev; config.yaml carried machine-local tweaks. Both are kept locally but removed from this branch so it contains only the parallelism/pause framework changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
An earlier merge left main.py's train loop in a non-runnable state. Restore a working loop (infinite-generator batching that re-shuffles each epoch, per-sample loss/IoU via the criterion dict) and default the loaders to num_workers=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix ws-detection example: restore runnable training loop
…llel+distributed+plightning+networkfs Framework intergrations/parallel+distributed+plightning+networkfs
* fix oom bug on break by slice The break-by-slice handler called get_signal_history_per_sample(), which inflated the entire per-sample signal history into nested dicts and then triple-looped over it (~609 MB spike per slice query -> OOM). Separately, query_per_sample() compared sample ids as int (stored) vs str (queried), so the cheap path would have silently returned 0 rows. Query the compact per-signal arrays directly via query_per_sample, normalize the id compare to str, and derive audit_mode from the eval-marker hash. A 200-sample slice over 2.1M entries now peaks at +0.6 MB instead of +609 MB (~1000x lighter). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix BBS feature to make the main computation part on our side --------- Co-authored-by: Alexandru Rotaru <rotarualexandruandrei94@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: GuillaumePELLUET <guillaume@graybx.com>
…d csv (#183) * Implement comprehensive audit logging for all gRPC user interactions with before/after tracking - Create AuditLogger class in backend/audit_logger.py with thread-safe JSON/CSV writing - Initialize audit loggers in ExperimentService and DataService with root_log_dir - Log all 8+ gRPC handlers with detailed before/after values: * ExperimentCommand: hp_change, mode_switch, pause, resume * GetLatestLoggerData: metrics_fetch * RestoreCheckpoint: checkpoint_restore * TriggerEvaluation: evaluation_start * EditDataSample: tag_add, tag_remove, sample_discard, sample_restore * ApplyDataQuery: query_execute * GetDataSamples: data_fetch - Append-only audit_log.json and audit_log.csv files in root_log_dir - ISO 8601 timestamps with microseconds in JSON; CSV with escaped JSON details - Thread-safe file operations Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add comprehensive unit tests and documentation for audit logging Tests: - 26 unit tests covering all AuditLogger functionality - Tests for JSON and CSV output formats - Thread-safe concurrent logging with 10+ threads - Error handling and edge cases - Real-world scenario tests (hyperparameter changes, data edits, training control) - Special characters and Unicode handling - All tests passing Documentation: - Comprehensive audit_logging.rst guide with examples - Overview of what's logged (7+ action types) - JSON and CSV format specifications - Configuration and file locations - Real-world scenarios and troubleshooting - API reference and best practices - Added to docs index for discoverability Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Reorganize documentation: separate gRPC functions and audit logging Documentation restructuring: - Created new gRPC section with subsections - docs/grpc/index.rst: Overview and architecture of gRPC communication - docs/grpc/grpc_functions.rst: Complete reference of all RPC handlers (13 methods) * ExperimentCommand (HP changes, pause/resume, mode switching) * GetLatestLoggerData (metrics and signals) * RestoreCheckpoint, TriggerEvaluation, GetEvaluationStatus, CancelEvaluation * GetDataSamples, ApplyDataQuery, EditDataSample, GetDataSplits * GetWeights, GetActivations, GetSamples * Includes: request/response types, parameters, behavior, audit logging status * Covers: error handling, performance considerations, debugging, common patterns - docs/grpc/audit_logger.rst: Comprehensive audit logging documentation * Moved from docs/audit_logging.rst with updated cross-references * Explains what gets logged, file formats (JSON/CSV), configuration * Real-world scenarios, troubleshooting, API reference, best practices - Updated docs/index.rst to reference new gRPC section Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add configurable audit log output format via AUDIT_LOG_FORMAT environment variable - Modified AuditLogger to write only one format (json OR csv), not both - Added format parameter to AuditLogger.__init__() with environment variable support - AUDIT_LOG_FORMAT=json (default) or AUDIT_LOG_FORMAT=csv - Explicit format parameter takes precedence over environment variable - Updated all 33 tests to work with format selection: - Fixed TestAuditLoggerCSV, TestAuditLoggerErrorHandling, TestAuditLoggerThreadSafety - Added TestAuditLoggerFormat class with 4 new tests for format configuration - Updated docs/grpc/audit_logger.rst Configuration section with AUDIT_LOG_FORMAT details and precedence rules - All tests passing: 33/33 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add ability to disable audit logging with AUDIT_LOG_FORMAT=none - Added "none" format option to AuditLogger to disable audit logging entirely - When format="none", log_event() returns early without creating files - Added AUDIT_LOG_FORMAT=none to docs/configuration.rst environment variables - Updated docs/grpc/audit_logger.rst Configuration section with disable feature - Added 3 new tests for disable functionality (36 total tests, all passing): - test_none_format_disables_logging() - test_none_format_from_environment_variable() - test_explicit_format_none_overrides_json_default() - Precedence unchanged: explicit format > environment variable > default Use cases for disabling: - Reduce disk I/O overhead in high-performance scenarios - Disable audit history for development/debugging sessions - Focus on other logging without audit pollution Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix FutureWarning: Set incompatible dtype column to object before assignment When upserting data with mixed dtypes (e.g., initializing column with bool False, then assigning string/array values), pandas raises a FutureWarning about incompatible dtypes. Fix by casting both the existing column and incoming values to object dtype before assignment to prevent dtype conflicts during merge operations. This resolves: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Tests: test_h5_dataframe_store.py passes, FutureWarning no longer raised Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix AttributeError: Use correct EDIT_ACCUMULATE instead of non-existent EDIT_ADD enum The SampleEditType enum only defines: - EDIT_OVERRIDE: Replace all tags - EDIT_ACCUMULATE: Add/accumulate tags - EDIT_REMOVE: Remove tags The audit logging code was trying to use the non-existent EDIT_ADD enum value. Fixed by using EDIT_ACCUMULATE for tag_add operations, which is the correct enum value for adding/accumulating tags based on _calculate_tag_column_updates docstring. Error was: AttributeError: Enum SampleEditType has no value defined for name 'EDIT_ADD' Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * fix subscribe function to allow user to compute history-based samples * remove caching * fix warning issue with h5 * add sanity check on modeling feature and fix datasampler issues * clean custom signals decorator feature in readme and doc * Refactor query_per_sample to return dict of samples instead of list of tuples Changed return format from: List of (sample_id, step, value) tuples To: Dict mapping sample_id → list of dicts with 'model_age' and 'signal_value' keys Example: {'0': [{'model_age': x, 'signal_value': y}, ...], '1': [...]} Benefits: - More structured and readable format - Keys are labeled, not positional - Easier to work with in custom signals (e.g., computing loss variance) - Matches the format expected in SignalContext.subscribed_history Both get_current_signaL_history_per_sample and query_per_sample now return the new format. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix registered subscribed signals * Remove data fetch operations from audit logging Audit logging should only track user actions (write operations), not read-only operations like: - GetLatestLoggerData (metrics_fetch) - GetDataSamples (data_fetch) These are passive retrieval operations, not modifications to experiment state. Changes: - Removed data_fetch and metrics_fetch from audit logging documentation - Updated audit_logger.rst to list only user action types - Changed GetDataSamples and GetLatestLoggerData to 'Audit Logged: No (read-only operation)' - Updated reproducing experiment scenario to focus on user actions only Audit logging now logs only: - Model Control: hp_change, pause, resume, mode_switch - Data Operations: tag_add, tag_remove, sample_discard, sample_restore, query_execute - Checkpoint & Evaluation: checkpoint_restore, evaluation_start Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix audit logger implementation: reverse chronological ordering and test fixes - Fix _flush_to_json to reverse event order within batches for strict reverse chronological ordering (newest first) - Add buffer_size=1 to all test instances to ensure events flush immediately during testing - Update test expectations for reverse chronological order (newest events appear first in JSON) - Fix timestamp assertions in training control scenario to expect decreasing order (ts1 > ts2 > ts3) - All 41 audit logger tests now pass Features verified: 1. Persistence: audit logs append when restarting experiments from existing root_log_dir 2. Reverse chronological: newest events appear first in JSON output 3. Buffering: events batch in memory before writing to disk, with configurable buffer_size Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add audit logging for plot note operations - Log note_write action when users save or clear notes on plot points - Capture metric_name, model_age, note_text, and note_action (saved/cleared) - Update audit logger documentation to include note_write in actions list This allows compliance tracking of all user annotations and notes. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Remove buffering approach from audit logger - use immediate writes instead - Remove buffer_size parameter and _event_buffer from AuditLogger - Change to immediate writes on each log_event() call - Rename _flush_to_json/_flush_to_csv to _write_json/_write_csv for single event writes - Remove flush() and _flush_buffer() methods - Remove all buffering-related tests - Update documentation to reflect immediate write approach Benefits: - No data loss on process crash or sudden termination - All audit events are persisted immediately to disk - Simpler implementation with same persistence guarantees - Still maintains reverse chronological ordering (newest first) Tests: 38 passed (3 buffering tests removed) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Expand signal decorator and SignalContext documentation - Add comprehensive parameter reference table for @wl.signal decorator * name, subscribe_to, compute_every_n_steps, include_history, include_history_metadata * Include performance considerations and use cases - Add advanced example from weightslab_kitchen: loss coefficient of variation * Shows how to access subscribed_history for multi-step analysis * Demonstrates history entry structure (signal_value, model_age) * Real-world use case: detect training instability - Expand SignalContext documentation with detailed attribute reference * Separate sections for dynamic signals vs. static signals * Document subscribed_history structure and access patterns * Add convenience properties (image, points, is_static, is_dynamic) * Include usage patterns and code examples This makes it clearer how to write effective custom signals with full history access. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add sanity check with iterator * Fix utests bug with AuditLogs * update hard coded signals desc * set needs btw utests and pip publish packages and test * fix code quality issues * remove data fetching from audit and slow useless part of checkpoint manager loading (rng and data iterator state) as we do not manage data state reproducibility for now --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
* Upgrade database to handle multi-indexing samples_id // instance_id, and subsequent fuctions
* Fix certificate generation prompts in Windows tests
Remove certificate generation and validation prompts that appear when running
test_ui_docker_bridge tests on Windows. The test_complete_onboarding_workflow
was calling actual certificate generation code without mocking it, which would
trigger Windows certificate store installation prompts.
Changes:
- Added proper mocking of _generate_certs_with_fallback() function
- Added mocking of _run_shell_script() to prevent bootstrap script execution
- Properly configured CertAuthManager mocks with check_and_apply() return value
- Added from_env_or_default() mock configuration for nested calls
All 40 tests in test_ui_docker_bridge.py now pass without prompting for
certificate validation on Windows.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* Update ws-detection to use per-instance metrics and losses
Switch detection example from per-sample to per-instance metrics:
- Replace PerSampleDetectionLoss with PerInstanceDetectionLoss
- Replace PerSampleIoU with PerInstanceIoU
- Log hierarchical loss levels: instance, sample, and batch
- Enable per-instance loss tracking for multi-instance dataframe support
Changes:
- Import PerInstanceDetectionLoss and PerInstanceIoU from criterions
- Configure losses with return_levels=True to get instance/sample/batch breakdown
- Manually log per-instance and per-sample metrics via wl.log_sample_signals()
- Use 'batch' level loss for backward pass to ensure proper gradient flow
- Maintain per-instance IoU computation for bounding box evaluation
This enables comprehensive per-annotation analysis in the UI while maintaining
per-sample aggregation for backward pass compatibility.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* Add both per-sample AND per-instance metrics to ws-detection
Track metrics at both granularity levels:
Per-Sample Metrics (aggregated):
- PerSampleDetectionLoss: Bounding box, classification, DFL losses averaged per sample
- PerSampleIoU: IoU averaged per sample
- Auto-logged via per_sample=True
- Signal names: train/bbxs_sample, train/clsf_sample, train/dfl_sample, iou/train_sample
Per-Instance Metrics (per annotation):
- PerInstanceDetectionLoss: Individual bbox losses for each annotation
- PerInstanceIoU: IoU for each bounding box
- Manually logged via wl.log_sample_signals()
- Signal names: train/bbxs_instance, train/clsf_instance, train/dfl_instance, iou_instance
This enables:
- Aggregate per-sample analysis for model evaluation
- Fine-grained per-annotation debugging
- Identification of problematic detections at the instance level
- Proper gradient flow through batch-level loss for training
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* Auto-save per-instance signals via per_instance flag on watch_or_edit
Add framework-side support for per-instance signal logging, mirroring the
existing per_sample flow. Users can now wrap a per-instance loss/metric with
`per_instance=True` and WeightsLab will:
1. Extract instance values from dict outputs (`{'instance','sample','batch'}`
from PerInstanceDetectionLoss) or flat tensors (PerInstanceIoU).
2. Look up `batch_idx` from the second positional argument (the standard
detection `batch` dict) or from kwargs, mapping each instance to its
sample position.
3. Assign annotation_ids 0,1,2,... within each sample.
4. Save per-instance values to the dataframe at `(sample_id, annotation_id)`
via the new `DATAFRAME_M.enqueue_instance_batch`.
5. Still log the per-sample aggregated mean for the dashboard and return the
original dict to the caller so `out['batch']` works for backward.
Changes:
- `dataframe_manager.enqueue_instance_batch`: writes per-annotation rows
using `update_values` (handles multi-index natively).
- `src.save_instance_signals`: helper that maps instance values to
(sample_id, annotation_id) via batch_idx and routes to the dataframe.
- `wrappered_fwd`: detects `per_instance=True`, unwraps dict outputs,
invokes `save_instance_signals`, and returns the original dict.
- `ws-detection/main.py`: replaces manual `wl.log_sample_signals` calls
with `per_instance=True` on the watch_or_edit registrations.
- New unit test `test_enqueue_instance_batch_writes_per_annotation`
validates the end-to-end write path.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* Simplify PerInstanceDetectionLoss to return flat instance tensor
Drop the dict-with-levels return type from PerInstanceDetectionLoss — only
the per-instance values are needed, since PerSampleDetectionLoss already
provides the per-sample gradient path for backward.
PerInstanceDetectionLoss now returns a flat `(num_instances,)` tensor,
ordered as in `batch['batch_idx']`. With `per_instance=True` on
watch_or_edit, the framework auto-saves these values at
`(sample_id, annotation_id)` in the dataframe.
Changes:
- `criterions.py`: remove `return_levels` param; forward returns a flat
instance tensor and only that.
- `main.py`: backward now uses `per_sample.mean()` from
PerSampleDetectionLoss; per-instance criterions are called only for their
side-effect of auto-saving annotation-level signals.
- `src.py`: skip the per-sample save_signals path when `per_instance=True`
(instance-length tensors don't map 1:1 to batch_ids).
- New test `test_save_instance_signals_maps_batch_idx_to_annotation_ids`
verifies the (sample_id, annotation_id) mapping from batch_idx.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* Fix duplicate-label error when per-sample buffer flushes into multi-index df
enqueue_batch produces single-level (sample_id) records, but when the global
dataframe has a MultiIndex, concatenating them directly creates a hybrid index
with both tuples and ints. The next flush then crashes with:
ValueError: cannot reindex on an axis with duplicate labels
at _apply_buffer_records_nonblocking.
Root cause: _apply_buffer_records and _apply_buffer_records_nonblocking didn't
bridge between the single-level buffer and the multi-index global dataframe.
Fix: add _broadcast_to_multi_index which expands each single-level (sample_id)
buffer record into one row per existing (sample_id, annotation_id) pair. Both
apply paths now invoke it before merging, so the global dataframe stays a
proper MultiIndex and per-sample signals are broadcast to every annotation of
the sample.
Adds regression test test_per_sample_buffer_into_multi_index_does_not_corrupt
that asserts index integrity through two consecutive flushes and that
per-sample values are correctly broadcast to all annotations.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* Fix _normalize_arrays_for_storage on multi-index rows
When the dataframe is multi-indexed, `row.name` is a
`(sample_id, annotation_id)` tuple. The code passed that tuple directly to
`dataset.get_index_from_sample_id` (which expects a plain sample_id string),
causing every array-column normalization to fail with a KeyError. The error
was caught and only logged at DEBUG level, but it flooded the log and
disabled the target/prediction normalization on every flush.
Extract `sample_id = row.name[0]` when `row.name` is a tuple; otherwise
fall back to the original row.name.
Adds regression test test_normalize_arrays_for_storage_handles_multi_index_row
that injects a fake dataset and asserts the sample_id (not the tuple) is
passed to `get_index_from_sample_id`.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
* cleaning branch
* fix usecases examples configs and python file
* fix grpc and agent interface with multi-indexing
* fix multi-indexing issues
* fix ext files
* Add cat. tag management and fix tests
* Fix documentation
* Fix h5 compat. with multi-indexing
* Fix utests and add new ones
* Fix multi-index issues with trainer gRPC functions; h5 array issues for sync batch idx; and detection tasktype bb; and finally update the documentation
* Fix broken utests
---------
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
* DDP support (WIP): SPMD primitives + 4-plane model + workers-correctness fixes
Wraps the existing single-process YOLO ws-detection example for DDP via
mp.spawn — no train.py edits needed past the spawn shim. Surfaces:
- weightslab/components/ddp_basic_building_blocks.py — SPMD primitives:
register_consistent_state, reconcile_all (bundled DOWN broadcast),
register_outbox + flush_outbox (bundled UP gather), sync_step (per-step
anchor + collective pause-spin). One broadcast + one gather per step.
- weightslab/components/ddp_planes.py — the 4-plane model (CONFIG /
CONTROL / DATAFRAME / LOGGER) + 5 dtype-keyed reducers (MAX / LATEST /
UNION / RANK_0_ONLY / IGNORE) + DOWN_ONLY whitelist for cross-rank DOWN-
flowing per-sample columns.
- components/global_monitoring.GuardContext — guard_training_context now
auto-registers the core states + invokes sync_step on first DDP entry.
- data/dataframe_manager.py — shm mirror of DOWN_ONLY columns visible to
DataLoader subprocess workers via fork; per-cell value-change gate so
rank-N's idempotent reconcile applies don't thrash worker resets; iter
invalidation triggers on real DOWN_ONLY mutations only.
- backend/dataloader_interface.py — WeightsLabDataSampler composes
DistributedSampler under DDP; sampler reads the shm mirror at yield time
(fork-safe); DataLoaderInterface gains _invalidate_iter to drop prefetched
stale batches on the trainer thread (avoids the std::terminate crash from
worker shutdown on a non-owning thread).
- trainer/services/experiment_service.py — RestoreCheckpoint passes
force=True (data snapshot was silently skipped when hashes appeared
equal) and re-pauses post-load (saved hp had is_training=True, which
load_state's register_hyperparams would otherwise re-apply).
- components/checkpoint_manager.py — three reset_iterator sites route
through the lazy _invalidate_iter path under DDP+workers.
- examples/PyTorch/ws-detection/src/main_ddp.py — spawn shim worker.
- examples/PyTorch/ws-detection/src/yolo_pipeline.py — extracted YOLO
pipeline (replaces the older ddp_smoke._build_pipeline / decode helpers).
- examples/PyTorch/ws-detection/src/ddp_test_suite.py — 8-scenario gRPC
integration suite: epoch_then_pause, discard_subset_freezes,
break_by_slice, lr_batch_propagate, checkpoint_data_roundtrip,
signal_coverage_all_graphs, resume_continues_curve, process_topology.
- tests/test_ddp_primitives.py — trivial 3-rank gloo verification of
reconcile_all (convergence + idempotency + change propagation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add 4 gap-coverage scenarios + collective-budget instrumentation
12 scenarios green end-to-end (8 original + 4 new). The new ones cover
gaps that were missing direct verification:
- scenario_multi_epoch_stability — 3 epochs back-to-back, asserts
(sid, model_age) entries are unique per graph (idempotent dedup at the
outbox merge) + age is strictly monotonic across epochs. Catches the
regression where outbox flushes would append rather than upsert per-step.
- scenario_empty_shard_starvation — discards ~95% of populated samples;
asserts the trainer does NOT silently hang at the next grad all-reduce
when one rank's shard ends up empty. Verifies loader cycle-and-skip
semantics under heavy filtering.
- scenario_seed_determinism — two consecutive break_by_slice pulls of the
per-sample loss history return byte-identical (sid, age, val) triples.
Catches stochasticity leaks in the read path that would silently break
the loss-shape descriptor downstream.
- scenario_collective_budget — programmatically asserts that every
training step uses EXACTLY 2 collectives (1 reconcile_all broadcast +
1 flush_outbox gather). Hard perf gate against future regressions that
add a stray dist.broadcast / dist.all_reduce in a hot path. Requires a
small SDK hook: WL_DDP_COLLECTIVE_LOG=<path> appends the prior step's
count to a file from inside reset_collectives() — opt-in, no overhead
when unset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add scenario_curate_lifecycle — end-to-end UI curation flow
Tests the realistic multi-edit workflow under DDP:
epoch 1 → tag 3 suspects → discard them → epoch 2 → un-discard + tag
'verified' → epoch 3 → assert loss trajectory shows the gap.
Assertions:
[1] LIFECYCLE — for each suspect: pre-discard entries exist, NO entry
in the (discard_age, undiscard_age] window for any of them (this
is the proof that discard reached the workers' shm + sampler
fast-path), AND ≥1 suspect resumes post-undiscard.
[2] TAG COMPOSE — break_by_slice('verified') returns all 3 suspects
(proves multi-tag stacking on the same sample).
[3] PLOT METRICS — scalar plot has ≥3 epochs worth of points.
Side change: Client.discard now accepts discarded=False to un-discard via
the same EditDataSample RPC.
This brings the suite to 13 scenarios, all green at WL_DDP_BATCH=4,
WL_DDP_WORKERS=0, WL_DDP_WORLD_SIZE=2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Cleanup: drop unused primitives + document drop_last=False rationale
Removes ~155 LOC of dead-code surface that accumulated during the WIP push:
- weightslab/components/ddp_basic_building_blocks.py: drop aggregate_up
decorator, replicate_down decorator, reconcile_down (single-state hook),
plus the combine-helpers (_concat, combine_rank0, make_concat_combine)
that only existed to serve aggregate_up. The outbox/flush pattern
superseded ① aggregate_up for per-sample hot writes (one gather/step
instead of one per call); reconcile_all replaced reconcile_down (bundled
broadcast); replicate_down was never invoked. Zero external references
to any of them. Net -93 LOC.
- weightslab/utils/tools.py + utils/__init__.py: drop DistributedCounter
(CUT-tagged in the design notes, never adopted). Net -62 LOC.
- weightslab/backend/dataloader_interface.py: keep drop_last=False on the
DistributedSampler, but document why. Padded yields are real training
events that land in the loss trajectory as real (sid, model_age, value)
encounters with distinct ages from the sample's earlier yield; the
trajectory is encounter-keyed, not per-epoch-unique-keyed, so padding is
honest rather than pollution. drop_last=True was considered but rejected
as too trivial — it'd silently drop the trailing (world-1) samples each
epoch and bias coverage downward.
Verified: scenario_discard_subset_freezes PASS at WL_DDP_BATCH=4,
WL_DDP_WORKERS=0, WL_DDP_WORLD_SIZE=2, WL_DDP_TEST_STEPS=20 — 156
populated samples, 5 discards held frozen across epoch 2, ~80% advance
on non-discarded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rename ddp_planes → parallel_state, ddp_basic_building_blocks → parallel_primitives
Generalizes naming away from "DDP" since the primitives don't assume the
specific torch.distributed-DDP topology — they hold for FSDP / ZeRO / any
SPMD setup with a rendezvous-on-collective contract.
- weightslab/components/parallel_primitives.py (was ddp_basic_building_blocks)
- weightslab/components/parallel_state.py (was ddp_planes)
- docs/ddp_design.md (was components/ddp_design_notes.md)
Updated import sites (4): global_monitoring.py, dataframe_manager.py,
parallel_primitives.py (self-ref), tests/test_ddp_primitives.py.
Also: scenario_lr_batch_propagate threshold loosened from
`(expected + rank0_only) / 2` to `rank0_only + 1`. The old midpoint sat
right at the noise band; under drop_last=False with mid-iter batch-size
transitions the observed rate floats 13–14 samples/step, occasionally
tripping at exactly 13.75 < 14. New threshold cleanly distinguishes
"both ranks doubled" from "only rank-0 doubled" (rate ~12).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix load_state to preserve model identity on same-arch restore
ledgers.register_model(new_model) replaced the registered object, orphaning
any captured reference (e.g. `model = trainer.model` in a training loop, or
DataLoaderInterface.self.model). Post-restore the trainer trained a stale
model while pause-checks read the fresh one, so pause_at_step never fired —
caught by scenario_resume_continues_curve in the DDP suite.
Skip register-replace + guard updates when existing model has same keys
AND shapes as the saved weights; let apply-weights load in-place. Add a
regression assertion in test_06 that captures the wrapped model identity
across load_state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rewrite docs/ddp_design.md as a concise design overview
Cut 159 → 66 lines (~58% smaller, 1426 → 588 words). Drops decision-tag
scratchpad ([DECIDED]/[OPEN]/[DEFER]/[CUT]), per-state placement debate,
wrapper-prologue code sketches, and the open-questions section — all
process residue. Keeps:
- Two-space framing (train-space vs sdk-space, kernel/user analogy).
- SPMD-with-one-privileged-rank constraint.
- Two-kinds-of-sync framing (grad reduction = off-the-shelf; async UI = WL's job).
- The loop-iteration-as-transaction insight and the train→sdk transition
as the consistency boundary.
- State × direction table.
- DOWN broadcast / UP outbox / shm mirror mechanisms with API entry points
(register_consistent_state / register_outbox + the anchor functions).
- Collective budget (~2 rendezvous/step) + instrumentation env vars.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Outbox ships per-step deltas, not full snapshots
local_df_writes / local_signal_triples emitted the WHOLE dataframe and the
WHOLE per-sample signal history every step, gathered to rank-0 each step. The
~2-collectives/step budget bounds the COUNT of rendezvous but not their bytes,
so payload scaled with N_samples x world (df) and grew unboundedly (signals) —
the real scaling wall.
Each rank now dumps only what changed since its last flush: changed dataframe
rows (vs a process-local _LAST_SENT_DF signature cache) and signal triples past
a per-(graph, exp_hash) cursor read straight off the append-only buffers. On
respawn/restore the cache resets to a one-time full resend, safe because every
merge is idempotent.
merge_df_writes seeds rank-0's current values first (existing-first) before the
per-column reducer, so a delta that omits a sample rank-0 already holds a higher
value for cannot regress MAX/UNION, while LATEST still resolves to the newest
delta. clear_registry resets the caches.
Docs + outbox comment updated to describe the delta transport and clarify the
budget governs collective count, not payload; shm section corrected to note only
the bool deny-list (`discarded`) is read at __getitem__, not user_tags.
Validated: scenario_signal_coverage_all_graphs PASS (per-sample 940/940 across
both ranks; test_ddp_primitives 3-rank reconcile PASS).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Align DDP module docstrings with ddp_design (remove stale primitives)
The parallel_primitives module docstring advertised three decorator-form
primitives — AGGREGATE_UP, REPLICATE_DOWN — and a reconcile_down single-state
hook, none of which exist in the code (grep finds them only in the docstring).
The implemented + design-doc'd surface is two mechanisms: register_consistent_
state/reconcile_all (DOWN) and register_outbox/flush_outbox (UP), plus the shm
deny-list mirror and the sync_step anchor. Rewrote the docstring to match that
and dropped the ①/②/③ numbering everywhere (parallel_state plane table and the
inline anchor comments now use plain DOWN/UP, matching docs/ddp_design.md).
Comment/docstring-only; no logic change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add fast unit tests for the outbox delta optimization
Proves the delta change's correctness in ~30ms instead of via the multi-minute
YOLO scenarios: df_writes emits only changed rows; merge_df_writes seeds rank-0's
current value so a stale/lower delta can't regress a MAX column (last_seen) while
a higher one still raises it and LATEST picks the newest delta; local_signal_
triples advances a per-(graph, exp_hash) cursor and resends from 0 if the buffer
shrank under it (restore safety). No DDP spawn — ledger getters are monkeypatched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Document DDP sampler sharding is training-only (eval is latent-unsupported)
The sampler shards EVERY loader, but the per-step anchor runs only in training
context — so a sharded eval under DDP would have each rank score 1/world with no
scalar-metric aggregation (undercount). No eval runs under DDP today, so this is
latent; added a TODO(ddp-eval) guard at _get_dist_sampler so nobody adds a DDP
eval loop without first resolving the eval sharding/aggregation policy.
Comment-only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cap shm deny-list index so large/sparse uids don't blow allocation
The DOWN-only shm vec is indexed directly by int(sample_id), so a single large
sparse uid (e.g. an inode-based id ~1e8) allocated a ~100MB bool array. Cap the
index at 1<<22 (~4M → ~4MB max per origin/col): ids at/above the cap skip the
shm fast-path and fall back to the sampler's pandas deny-list check, which is
the actual read site (the shm is read in the main-process sampler, not workers —
docstring corrected to say so). Dense 0..N id schemes keep the fast-path. Warns
once per origin when the cap is hit.
Adds test_ddp_shm_cap: huge id neither allocates a giant array nor breaks the
small-id fast-path; undiscard clears the cell.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Restrict shm mirror to boolean deny-list columns (skip user_tags)
_propagate_to_shm ran bool(val) over every DOWN_ONLY column into the bool shm
vec, so user_tags (a list column) stored a meaningless bit that nothing reads
(the sampler only queries 'discarded'). Filter the mirror to genuinely boolean
columns via _shm_bool_eligible: bool dtype, or an object column whose non-null
cells are all bool / 0-1 scalars. user_tags still reconciles to children via the
DOWN broadcast; it just no longer gets a bogus shm array.
Test: a user_tags list column allocates no shm array while discarded still does.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Decouple DDP shard reshuffle from iterator reset (epoch → reshuffle_seq)
_generate_indices auto-advanced the DDP epoch on every fresh iterator, so a
mid-loop discard/tag — which forces an iterator reset — reshuffled the entire
per-rank shard. In a curation workflow (discard-heavy by design) the shard order
churned on every discard, and "epoch" counted iterator resets, not dataset
passes — a meaningless abstraction here.
Rename _epoch → _reshuffle_seq (it's a reshuffle generation, not a pass count)
and stop auto-advancing it. The reshuffle generation now advances ONLY on a
genuine pass-end reset (loader calls sampler.advance_reshuffle() on the
_epoch_exhausted path); the discard-invalidation reset path leaves it untouched,
so a discard re-filters the SAME permutation instead of reshuffling.
Reproducibility across resets is preserved but composed correctly: the per-rank
permutation is a pure fn of (ddp_seed, reshuffle_seq, rank, world), so
capture/restore_iteration_state now save/restore reshuffle_seq + seed; combined
with samples_yielded (offset) and the deny-list (checkpointed as a DOWN_ONLY df
column) this reproduces the exact filtered stream. Warns on a seed mismatch.
Side effect: also fixes the __len__-vs-iteration epoch off-by-one — neither
_generate_indices nor _rank_indices_snapshot advances during iteration now, so
they read the same generation.
Adds test_ddp_reshuffle_seq: re-gen without advance is stable; advance reshuffles;
restore reproduces; ranks partition disjoint+cover; seed mismatch warns.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Update README.md
* Guard user callbacks at collective boundaries so one can't hang the group
reconcile_all built its snapshot ({name: snap()}) and applied children states
unguarded, and flush_outbox built its payload ({name: dump()}) unguarded. Under
SPMD a callback that raises on one rank but not others crashes that rank BEFORE
(snapshot/dump) or AFTER (apply) a collective, leaving every other rank blocked
forever on the broadcast/gather.
Wrap each snap()/apply()/dump() in try/except (merge() already was), so the
collective itself is ALWAYS reached: a failed state/channel ships as None
(apply/merge already tolerate None) and the rest sync normally. Matches the DDP
module's existing swallowed-exception style (logger.debug("[tag] ... failed: %s",
exc)). Also switch parallel_primitives' logger from getLogger("weightslab.ddp")
to getLogger(__name__) to match the rest of the codebase.
Adds test_ddp_collective_resilience: a "bad" state (snapshot raises on rank 0,
apply raises on children) + a "bad" outbox (dump/merge raise) run two anchor
rounds + a barrier without hanging, and the healthy state still syncs on all
ranks. Original test_ddp_primitives still passes (no reconcile regression).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add world>2 uneven-shard coverage to the reshuffle test
n=25, world=3 exercises DistributedSampler's drop_last=False padding: each shard
pads to length 9 and the union still covers the whole universe. Complements the
even world=2 disjoint+cover case.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Remove the redundant shm deny-list mirror
The shared-memory DOWN_ONLY mirror was never load-bearing: the deny-list is
enforced sampler-side (a discarded sample is never yielded), the sampler's pandas
cache already refreshes on the deny-list revision bump (so a live discard is
reflected within one index), and a sample already in a worker's prefetch queue is
dropped by iterator invalidation tearing the workers down — NOT by shm (shm
filtered at yield time, before the discard existed, so it never had power over
queued samples). No test depended on it; nothing read it except the main-process
sampler.
Remove _propagate_to_shm / _ensure_shm_capacity / _shm_bool_eligible /
is_in_down_only_shm + the shm fields and the sampler's shm fast-path (~190 LOC,
plus the ctypes/multiprocessing/os imports). Replace the invalidation gate — which
must stay gated on an ACTUAL value change so rank-1+ don't respawn workers every
step under DDP — with a pandas before/after diff (_down_only_changed) computed
before the upsert merges. Update docs (design + comments) to describe sampler-side
enforcement + invalidation, dropping the inaccurate "workers fork-read the shm"
story. Supersedes the shm-cap (task #3) and user_tags-shm (task #4) fixes.
Drops test_ddp_shm_cap; adds test_ddp_down_only_change covering the gate,
including the DDP no-respawn invariant (re-applying the same snapshot → no change).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* DDP: rebalance-on-discard sharding, drop per-flush image decode, anchor/delta cleanups
- Sampler: replace shard-then-filter with a filter->pad->stripe REBALANCE so live
shards are always equal length across ranks — fixes the empty-shard deadlock
(scenario_empty_shard_starvation) by construction; order-preserving + deterministic
(not a reshuffle). persistent_workers=True makes a discard/undiscard rebuild a
cheap reset (reuse workers, drop stale prefetch) instead of a fork+reinit.
drop_last=False under DDP keeps a tiny live set training so age still advances.
- dataframe_manager: stop the storage-time bbox->seg get_mask conversion — it
re-decoded a full image per signal flush (~13% of rank-0 wall) and silently
corrupted detection prediction_raw; mask rendering stays available on-demand via
get_prediction_mask. Remove the now-dead normalize/_is_array_column/_get_loader.
- Anchor split DOWN(__enter__)/UP(__exit__); outbox ships per-rank deltas via a
writer dirty-set; remove post-hoc active-sample masking from the model wrapper.
- Test suite: compute epoch_steps from WL_DDP_BATCH (was config's mono batch=4 while
the loader trained at 16 -> every "epoch" silently covered ~4 passes); add
scenario_progressive_resample (shrink->grow) + per-phase timing + a WL_DDP_SELFSPY
self-profiling hook.
- docs/ddp_design.md: document rebalance-not-reshuffle + persistent-worker reset.
Note: 2 coverage assertions (epoch_then_pause populated~=shard, progressive_resample
advance%) still assume the old over-training and read the per-sample gather before it
fully settles for a now-correct single epoch; recalibration deferred.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* DDP: make WL_DDP_BATCH effective, event-based pause (no sleep), resumable suite
- yolo_pipeline: push WL_DDP_BATCH into the in-memory cfg, not just the DataLoader
ctor. _sync_batch_size_from_ledger re-applies the ledger's batch every iteration,
so without this the loader silently reverted to config's mono batch=4 and the env
was dead — the suite trained at 4 while epoch_steps assumed 16 (¼-epoch coverage,
which failed epoch_then_pause / progressive_resample after the epoch-math change).
Now img=16 for real: full single-epoch coverage + genuine ~23% speedup (4× fewer
steps, same work). config.yaml on disk untouched (mono unaffected).
- parallel_primitives / global_monitoring: kill the 20ms busy-sleep pause-spin in
sync_step. Rank 0 blocks on the pause_controller resume Event (new wait_for_resume);
rank-1+ block inside the next reconcile_all broadcast. Neither spins (gloo socket-
waits; NCCL would busy-spin — noted). Bounded timeout kept for SIGINT/SIGTERM
responsiveness, not polling.
- ddp_test_suite: WL_DDP_SKIP (comma-sep substrings) so a killed run resumes by
skipping already-passed scenarios.
Full 14-scenario suite green at batch-16 (incl. empty_shard_starvation,
progressive_resample, and the event-based pause via epoch_then_pause).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* DDP: trim verbose comments (no behavior change)
Condense the "novel"-length comments added during the DDP work to 1-2 tight
sentences each — worst offender was dataloader_interface (rebalance/reshuffle,
__iter__, __len__, persistent_workers, _reset_iterator). Key invariants kept;
redundant restatements dropped. ~80 fewer comment lines. Code unchanged
(reshuffle unit test still green).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* DDP: delta the DOWN reconcile + vectorize the UP merge (anchor 2x faster, no O(N)/step)
Two hidden O(N)-per-step costs in the cross-rank anchor, both removed:
- DOWN reconcile (rank0_df_down_state) rebuilt+pickled+broadcast {col:{sid:val}} for
ALL non-null DOWN_ONLY cells every step (discarded defaults False=non-null, so all
touched samples rode) — O(N). Now a DELTA: rank-0 ships only the sample-ids whose
DOWN_ONLY changed since the last reconcile (drain_down_delta dirty-set, populated in
upsert_df on a real DOWN diff), with one full snapshot on first reconcile / post-
restore (mark_down_full_resend, hooked in _load_existing_data) so children converge.
N-sweep: full build+pickle was 1.5ms@1k / 119ms@100k / 619ms@500k; delta ~0 when
unchanged, O(discards) otherwise.
- DOWN_ONLY trimmed to {"discarded"} — "user_tags" was never a real column (it's
"tag"), tags are rank-0 UI state (tag->label override is vestigial), and tag queries
gather signals UP + filter on rank-0, so nothing tag-shaped needs to reach a sampler.
- UP merge (merge_df_writes): replaced per-column groupby.apply(python reducer) with one
vectorized groupby.agg({col:'max'/'last'}) — _r_max/_r_latest map exactly to skipna
max/last; policy_for only yields MAX/LATEST here (UNION is tags, DOWN-filtered). And
_rank0_existing_seed stopped copying the WHOLE dataframe every flush — it now indexes
just the delta's ~batch rows.
Anchor 168 -> 153 (DOWN delta) -> 78.6 ms/step (merge), ~2x. Validated:
discard_subset_freezes, progressive_resample, break_by_slice, curate_lifecycle,
signal_coverage all PASS.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Add ddp_ablation.py — 3-mode WL SDK overhead harness (time/mem/IO/bytes per rank)
WL_ABLATE=ul|wlimport|wl on 2 gloo ranks, WL_ABLATE_STEPS configurable. Per mode:
per-section ms/step, grad bytes/step, per-rank RSS + /proc/self/io (rchar/wchar =
syscall bytes incl. gloo sockets; read_bytes/write_bytes = actual disk), and the WL
df RAM + H5 store sizes + H5 flush config.
256-step result: WL time tax +247ms/+17.6% (criterions+log +148ms = save_signals +
NMS decode-for-logging is the biggest, anchor +89ms, loader/wrapper +28ms); RAM +108MB
import-idle + ~40MB active; df/H5 tiny. I/O surfaced that rank-1 redundantly persists
~6MB to H5 (should be outbox-only — rank-0 is authoritative).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Relocate DDP integration suite + perf/ablation into tests/
PR #185 review: tests don't belong in the usecase dir. Move ddp_test_suite.py,
ddp_ablation.py, aggregate_wl_ownership.py and the report driver out of
examples/PyTorch/ws-detection/src into tests/integrations/ultralytics/ddp/.
One god-script (run_ddp_report.sh) with modes info/scenarios/ablation/profile
emits a single report: perf counters, per-scenario times, and the wl-ulmanual
ablation delta. Scripts path-bootstrap back to the usecase src so
yolo_pipeline / utils.* / config.yaml resolve. Added README + .gitignore.
Locally-run (needs GPU + dataset) — not a CI unit test.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* PR #185: all_reduce_scalar(avg) + mono/DDP usecase README
- utils.tools: add all_reduce_scalar(value, reduction="sum"|"avg"); avg = sum/world
since gloo has no ReduceOp.AVG. Keep all_reduce_sum_scalar as a back-compat wrapper.
- examples/.../ws-detection/README.md: document the mono (main.py) and DDP (main_ddp.py)
usecases, how to run, and the single-GPU gloo DDP simulation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* PR #185: exclude is_training/pause_at_step/root_log_dir from the saved HP snapshot
get_HP_snapshot dumped the whole hp dict, so a restore's register_hyperparams(
saved_config) resurrected experiment STATE (notably is_training=True) — the bug the
post-restore force in experiment_service worked around. Strip the same state-only keys
already excluded from the experiment hash, on a copy (never mutate the ledger). The
post-restore pause stays, now as the intentional "user drives the next cycle", not a
workaround.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Fix ruff F841: drop unused group_ids / pre_restore_max_plot_age / ctx
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Alexandru Rotaru <rotarualexandruandrei94@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…files, heading levels
- PRs now formatted as [#N](url) title — date with author GitHub profile links
- Contributors section links to github.com/{login} (from PR authors)
- Commits capped at 25 most recent non-merge commits
- Title: ## **Weightslab** (no version, ## level)
- Sections demoted to ### level
- Removed separator between title and LinkedIn/Graybx links
- Dev release routes PRs from --base dev, main from --base main
- Doc build gated on main-branch check (not just tag pattern)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, bb_B], batch_idx=[batch_idA, B, ..etc]
* Add to detection usecase dump history and custom signal labelling * upgrade documentation with new functions and examples * add and fix utests * refactor the logger and add instances history and queries functions, with a user wl.write_history function * add df writing for user during exp * fix code quality issues
GetDataSamples with an empty stats_to_retrieve fell through to the slow per-sample path that serializes every column. For dense tasks (e.g. 3D detection) pred/target are large JSON arrays (~310 KB/record), bloating the response to 100s of MB and silently breaking the histogram fetch. Route empty-stats requests (no image payload, no resize) through the fast vectorized metadata path, defaulting to all columns except heavy blob columns (pred/target/prediction_raw). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tegration issues with evaluation mode
…ds also; and remove dump model architecture for now
pandas >= 2.1 warns (and will error in a future version) when a partial .loc assignment changes a column's dtype - e.g. assigning object data (numpy arrays / stringified masks) or bool flags into a column that was initialized as float64 via np.nan. - h5_dataframe_store.upsert: add _align_col_dtype_for_assign() to widen the target column to object before partial merges (3 assignment sites). - dataframe_manager._merge_overwrite: widen target to object before assigning object/bool values into a numeric column. - dataframe_manager._apply_updates_frame_locked: keep the up-front categorical widen (uncatchable AssertionError otherwise) and make the masked block assignment promote the warning + widen-to-object and retry, covering array-into-float32 signal writes. Compatible numeric assignments keep their dtype (fast path). No production source emits the warning under -W error; data + ledger suites pass (153 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.