v1.2.6 by guillaume-byte · Pull Request #208 · GrayboxTech/weightslab

guillaume-byte · 2026-06-18T16:08:31Z

Fix ledgered parameters issues with ValueProxy
Fix the bounding boxes normalization issue in the prediction column

…ration pause_controller starts paused by default and is intended to be driven by the UI / training loop, not the data path. Calling _wait_if_paused() from __iter__ and __next__ meant any script that iterated the loader before an external resume would hang forever — even at num_workers=0. Workers made the failure mode look like a worker bug (leaked semaphores, freezes under load), but the same hang reproduced with no workers at all. Pause is a training-loop concern; the loader should just deliver bytes. _wait_if_paused() itself is preserved so training loops can still call it explicitly at safe points (between optimizer steps). Verified: - weightslab/tests/backend/test_data_loader_interface.py: 9/9 pass (incl. test_dataloader_interface_uses_multiple_workers, test_multiple_workers_parallelize_preprocessing) - Wider tests/backend + tests/components sweep: 108 pass, 2 unrelated pre-existing failures in test_ui_docker_bridge (cert script + Windows path test on Linux) - ws-detection example with num_workers={0,2,4} on CPU: clean runs, W=2 ~21% faster than W=0, no hangs or crashes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

a4cf489 removed DataLoaderInterface._wait_if_paused's call sites to fix the num_workers>0 startup deadlock, but that was also the only place enforcing the explicit pause_at_step hyperparam. Re-add the trigger in GuardContext.__enter__ (training only), before the architecture lock so the pause blocks lock-free. pause() zeroes pause_at_step, so it fires once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Its call sites were removed in a4cf489 (deadlock fix) and its pause_at_step trigger was relocated to GuardContext.__enter__ in the previous commit, so the method is now unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- main.py: refactor train loop to the infinite-generator form (re-shuffles each epoch) and default both loaders to num_workers=2 (GPU sweep: ~+76% throughput vs workers=0, sweet spot; W=4 regresses). - bench.py + run_bench.sh: configurable num_workers/epochs/wall-time harness with WL_BENCH_NO_VAL for clean throughput runs; forces the model onto the target device after watch_or_edit (see device note below). - config.yaml: local run tweaks. Note: watch_or_edit(flag="model", device=...) currently drops its device= kwarg (src.py model branch returns the proxy without honoring it), so the bench applies an explicit .to(device) workaround; the framework path is still unfixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…terface Both were only referenced by _wait_if_paused, removed in 41b76b8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

These were local correctness-check tooling for the num_workers sweep, not part of the example. Kept on disk locally (untracked), removed from version control. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

main.py (runnable train-loop fix) moves to a dedicated fix branch off dev; config.yaml carried machine-local tweaks. Both are kept locally but removed from this branch so it contains only the parallelism/pause framework changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

An earlier merge left main.py's train loop in a non-runnable state. Restore a working loop (infinite-generator batching that re-shuffles each epoch, per-sample loss/IoU via the criterion dict) and default the loaders to num_workers=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix ws-detection example: restore runnable training loop

…llel+distributed+plightning+networkfs Framework intergrations/parallel+distributed+plightning+networkfs

* fix oom bug on break by slice The break-by-slice handler called get_signal_history_per_sample(), which inflated the entire per-sample signal history into nested dicts and then triple-looped over it (~609 MB spike per slice query -> OOM). Separately, query_per_sample() compared sample ids as int (stored) vs str (queried), so the cheap path would have silently returned 0 rows. Query the compact per-signal arrays directly via query_per_sample, normalize the id compare to str, and derive audit_mode from the eval-marker hash. A 200-sample slice over 2.1M entries now peaks at +0.6 MB instead of +609 MB (~1000x lighter). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix BBS feature to make the main computation part on our side --------- Co-authored-by: Alexandru Rotaru <rotarualexandruandrei94@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: GuillaumePELLUET <guillaume@graybx.com>

…d csv (#183) * Implement comprehensive audit logging for all gRPC user interactions with before/after tracking - Create AuditLogger class in backend/audit_logger.py with thread-safe JSON/CSV writing - Initialize audit loggers in ExperimentService and DataService with root_log_dir - Log all 8+ gRPC handlers with detailed before/after values: * ExperimentCommand: hp_change, mode_switch, pause, resume * GetLatestLoggerData: metrics_fetch * RestoreCheckpoint: checkpoint_restore * TriggerEvaluation: evaluation_start * EditDataSample: tag_add, tag_remove, sample_discard, sample_restore * ApplyDataQuery: query_execute * GetDataSamples: data_fetch - Append-only audit_log.json and audit_log.csv files in root_log_dir - ISO 8601 timestamps with microseconds in JSON; CSV with escaped JSON details - Thread-safe file operations Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add comprehensive unit tests and documentation for audit logging Tests: - 26 unit tests covering all AuditLogger functionality - Tests for JSON and CSV output formats - Thread-safe concurrent logging with 10+ threads - Error handling and edge cases - Real-world scenario tests (hyperparameter changes, data edits, training control) - Special characters and Unicode handling - All tests passing Documentation: - Comprehensive audit_logging.rst guide with examples - Overview of what's logged (7+ action types) - JSON and CSV format specifications - Configuration and file locations - Real-world scenarios and troubleshooting - API reference and best practices - Added to docs index for discoverability Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Reorganize documentation: separate gRPC functions and audit logging Documentation restructuring: - Created new gRPC section with subsections - docs/grpc/index.rst: Overview and architecture of gRPC communication - docs/grpc/grpc_functions.rst: Complete reference of all RPC handlers (13 methods) * ExperimentCommand (HP changes, pause/resume, mode switching) * GetLatestLoggerData (metrics and signals) * RestoreCheckpoint, TriggerEvaluation, GetEvaluationStatus, CancelEvaluation * GetDataSamples, ApplyDataQuery, EditDataSample, GetDataSplits * GetWeights, GetActivations, GetSamples * Includes: request/response types, parameters, behavior, audit logging status * Covers: error handling, performance considerations, debugging, common patterns - docs/grpc/audit_logger.rst: Comprehensive audit logging documentation * Moved from docs/audit_logging.rst with updated cross-references * Explains what gets logged, file formats (JSON/CSV), configuration * Real-world scenarios, troubleshooting, API reference, best practices - Updated docs/index.rst to reference new gRPC section Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add configurable audit log output format via AUDIT_LOG_FORMAT environment variable - Modified AuditLogger to write only one format (json OR csv), not both - Added format parameter to AuditLogger.__init__() with environment variable support - AUDIT_LOG_FORMAT=json (default) or AUDIT_LOG_FORMAT=csv - Explicit format parameter takes precedence over environment variable - Updated all 33 tests to work with format selection: - Fixed TestAuditLoggerCSV, TestAuditLoggerErrorHandling, TestAuditLoggerThreadSafety - Added TestAuditLoggerFormat class with 4 new tests for format configuration - Updated docs/grpc/audit_logger.rst Configuration section with AUDIT_LOG_FORMAT details and precedence rules - All tests passing: 33/33 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add ability to disable audit logging with AUDIT_LOG_FORMAT=none - Added "none" format option to AuditLogger to disable audit logging entirely - When format="none", log_event() returns early without creating files - Added AUDIT_LOG_FORMAT=none to docs/configuration.rst environment variables - Updated docs/grpc/audit_logger.rst Configuration section with disable feature - Added 3 new tests for disable functionality (36 total tests, all passing): - test_none_format_disables_logging() - test_none_format_from_environment_variable() - test_explicit_format_none_overrides_json_default() - Precedence unchanged: explicit format > environment variable > default Use cases for disabling: - Reduce disk I/O overhead in high-performance scenarios - Disable audit history for development/debugging sessions - Focus on other logging without audit pollution Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix FutureWarning: Set incompatible dtype column to object before assignment When upserting data with mixed dtypes (e.g., initializing column with bool False, then assigning string/array values), pandas raises a FutureWarning about incompatible dtypes. Fix by casting both the existing column and incoming values to object dtype before assignment to prevent dtype conflicts during merge operations. This resolves: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Tests: test_h5_dataframe_store.py passes, FutureWarning no longer raised Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix AttributeError: Use correct EDIT_ACCUMULATE instead of non-existent EDIT_ADD enum The SampleEditType enum only defines: - EDIT_OVERRIDE: Replace all tags - EDIT_ACCUMULATE: Add/accumulate tags - EDIT_REMOVE: Remove tags The audit logging code was trying to use the non-existent EDIT_ADD enum value. Fixed by using EDIT_ACCUMULATE for tag_add operations, which is the correct enum value for adding/accumulating tags based on _calculate_tag_column_updates docstring. Error was: AttributeError: Enum SampleEditType has no value defined for name 'EDIT_ADD' Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * fix subscribe function to allow user to compute history-based samples * remove caching * fix warning issue with h5 * add sanity check on modeling feature and fix datasampler issues * clean custom signals decorator feature in readme and doc * Refactor query_per_sample to return dict of samples instead of list of tuples Changed return format from: List of (sample_id, step, value) tuples To: Dict mapping sample_id → list of dicts with 'model_age' and 'signal_value' keys Example: {'0': [{'model_age': x, 'signal_value': y}, ...], '1': [...]} Benefits: - More structured and readable format - Keys are labeled, not positional - Easier to work with in custom signals (e.g., computing loss variance) - Matches the format expected in SignalContext.subscribed_history Both get_current_signaL_history_per_sample and query_per_sample now return the new format. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix registered subscribed signals * Remove data fetch operations from audit logging Audit logging should only track user actions (write operations), not read-only operations like: - GetLatestLoggerData (metrics_fetch) - GetDataSamples (data_fetch) These are passive retrieval operations, not modifications to experiment state. Changes: - Removed data_fetch and metrics_fetch from audit logging documentation - Updated audit_logger.rst to list only user action types - Changed GetDataSamples and GetLatestLoggerData to 'Audit Logged: No (read-only operation)' - Updated reproducing experiment scenario to focus on user actions only Audit logging now logs only: - Model Control: hp_change, pause, resume, mode_switch - Data Operations: tag_add, tag_remove, sample_discard, sample_restore, query_execute - Checkpoint & Evaluation: checkpoint_restore, evaluation_start Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix audit logger implementation: reverse chronological ordering and test fixes - Fix _flush_to_json to reverse event order within batches for strict reverse chronological ordering (newest first) - Add buffer_size=1 to all test instances to ensure events flush immediately during testing - Update test expectations for reverse chronological order (newest events appear first in JSON) - Fix timestamp assertions in training control scenario to expect decreasing order (ts1 > ts2 > ts3) - All 41 audit logger tests now pass Features verified: 1. Persistence: audit logs append when restarting experiments from existing root_log_dir 2. Reverse chronological: newest events appear first in JSON output 3. Buffering: events batch in memory before writing to disk, with configurable buffer_size Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add audit logging for plot note operations - Log note_write action when users save or clear notes on plot points - Capture metric_name, model_age, note_text, and note_action (saved/cleared) - Update audit logger documentation to include note_write in actions list This allows compliance tracking of all user annotations and notes. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Remove buffering approach from audit logger - use immediate writes instead - Remove buffer_size parameter and _event_buffer from AuditLogger - Change to immediate writes on each log_event() call - Rename _flush_to_json/_flush_to_csv to _write_json/_write_csv for single event writes - Remove flush() and _flush_buffer() methods - Remove all buffering-related tests - Update documentation to reflect immediate write approach Benefits: - No data loss on process crash or sudden termination - All audit events are persisted immediately to disk - Simpler implementation with same persistence guarantees - Still maintains reverse chronological ordering (newest first) Tests: 38 passed (3 buffering tests removed) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Expand signal decorator and SignalContext documentation - Add comprehensive parameter reference table for @wl.signal decorator * name, subscribe_to, compute_every_n_steps, include_history, include_history_metadata * Include performance considerations and use cases - Add advanced example from weightslab_kitchen: loss coefficient of variation * Shows how to access subscribed_history for multi-step analysis * Demonstrates history entry structure (signal_value, model_age) * Real-world use case: detect training instability - Expand SignalContext documentation with detailed attribute reference * Separate sections for dynamic signals vs. static signals * Document subscribed_history structure and access patterns * Add convenience properties (image, points, is_static, is_dynamic) * Include usage patterns and code examples This makes it clearer how to write effective custom signals with full history access. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add sanity check with iterator * Fix utests bug with AuditLogs * update hard coded signals desc * set needs btw utests and pip publish packages and test * fix code quality issues * remove data fetching from audit and slow useless part of checkpoint manager loading (rng and data iterator state) as we do not manage data state reproducibility for now --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>

* Upgrade database to handle multi-indexing samples_id // instance_id, and subsequent fuctions * Fix certificate generation prompts in Windows tests Remove certificate generation and validation prompts that appear when running test_ui_docker_bridge tests on Windows. The test_complete_onboarding_workflow was calling actual certificate generation code without mocking it, which would trigger Windows certificate store installation prompts. Changes: - Added proper mocking of _generate_certs_with_fallback() function - Added mocking of _run_shell_script() to prevent bootstrap script execution - Properly configured CertAuthManager mocks with check_and_apply() return value - Added from_env_or_default() mock configuration for nested calls All 40 tests in test_ui_docker_bridge.py now pass without prompting for certificate validation on Windows. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Update ws-detection to use per-instance metrics and losses Switch detection example from per-sample to per-instance metrics: - Replace PerSampleDetectionLoss with PerInstanceDetectionLoss - Replace PerSampleIoU with PerInstanceIoU - Log hierarchical loss levels: instance, sample, and batch - Enable per-instance loss tracking for multi-instance dataframe support Changes: - Import PerInstanceDetectionLoss and PerInstanceIoU from criterions - Configure losses with return_levels=True to get instance/sample/batch breakdown - Manually log per-instance and per-sample metrics via wl.log_sample_signals() - Use 'batch' level loss for backward pass to ensure proper gradient flow - Maintain per-instance IoU computation for bounding box evaluation This enables comprehensive per-annotation analysis in the UI while maintaining per-sample aggregation for backward pass compatibility. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Add both per-sample AND per-instance metrics to ws-detection Track metrics at both granularity levels: Per-Sample Metrics (aggregated): - PerSampleDetectionLoss: Bounding box, classification, DFL losses averaged per sample - PerSampleIoU: IoU averaged per sample - Auto-logged via per_sample=True - Signal names: train/bbxs_sample, train/clsf_sample, train/dfl_sample, iou/train_sample Per-Instance Metrics (per annotation): - PerInstanceDetectionLoss: Individual bbox losses for each annotation - PerInstanceIoU: IoU for each bounding box - Manually logged via wl.log_sample_signals() - Signal names: train/bbxs_instance, train/clsf_instance, train/dfl_instance, iou_instance This enables: - Aggregate per-sample analysis for model evaluation - Fine-grained per-annotation debugging - Identification of problematic detections at the instance level - Proper gradient flow through batch-level loss for training Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Auto-save per-instance signals via per_instance flag on watch_or_edit Add framework-side support for per-instance signal logging, mirroring the existing per_sample flow. Users can now wrap a per-instance loss/metric with `per_instance=True` and WeightsLab will: 1. Extract instance values from dict outputs (`{'instance','sample','batch'}` from PerInstanceDetectionLoss) or flat tensors (PerInstanceIoU). 2. Look up `batch_idx` from the second positional argument (the standard detection `batch` dict) or from kwargs, mapping each instance to its sample position. 3. Assign annotation_ids 0,1,2,... within each sample. 4. Save per-instance values to the dataframe at `(sample_id, annotation_id)` via the new `DATAFRAME_M.enqueue_instance_batch`. 5. Still log the per-sample aggregated mean for the dashboard and return the original dict to the caller so `out['batch']` works for backward. Changes: - `dataframe_manager.enqueue_instance_batch`: writes per-annotation rows using `update_values` (handles multi-index natively). - `src.save_instance_signals`: helper that maps instance values to (sample_id, annotation_id) via batch_idx and routes to the dataframe. - `wrappered_fwd`: detects `per_instance=True`, unwraps dict outputs, invokes `save_instance_signals`, and returns the original dict. - `ws-detection/main.py`: replaces manual `wl.log_sample_signals` calls with `per_instance=True` on the watch_or_edit registrations. - New unit test `test_enqueue_instance_batch_writes_per_annotation` validates the end-to-end write path. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Simplify PerInstanceDetectionLoss to return flat instance tensor Drop the dict-with-levels return type from PerInstanceDetectionLoss — only the per-instance values are needed, since PerSampleDetectionLoss already provides the per-sample gradient path for backward. PerInstanceDetectionLoss now returns a flat `(num_instances,)` tensor, ordered as in `batch['batch_idx']`. With `per_instance=True` on watch_or_edit, the framework auto-saves these values at `(sample_id, annotation_id)` in the dataframe. Changes: - `criterions.py`: remove `return_levels` param; forward returns a flat instance tensor and only that. - `main.py`: backward now uses `per_sample.mean()` from PerSampleDetectionLoss; per-instance criterions are called only for their side-effect of auto-saving annotation-level signals. - `src.py`: skip the per-sample save_signals path when `per_instance=True` (instance-length tensors don't map 1:1 to batch_ids). - New test `test_save_instance_signals_maps_batch_idx_to_annotation_ids` verifies the (sample_id, annotation_id) mapping from batch_idx. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix duplicate-label error when per-sample buffer flushes into multi-index df enqueue_batch produces single-level (sample_id) records, but when the global dataframe has a MultiIndex, concatenating them directly creates a hybrid index with both tuples and ints. The next flush then crashes with: ValueError: cannot reindex on an axis with duplicate labels at _apply_buffer_records_nonblocking. Root cause: _apply_buffer_records and _apply_buffer_records_nonblocking didn't bridge between the single-level buffer and the multi-index global dataframe. Fix: add _broadcast_to_multi_index which expands each single-level (sample_id) buffer record into one row per existing (sample_id, annotation_id) pair. Both apply paths now invoke it before merging, so the global dataframe stays a proper MultiIndex and per-sample signals are broadcast to every annotation of the sample. Adds regression test test_per_sample_buffer_into_multi_index_does_not_corrupt that asserts index integrity through two consecutive flushes and that per-sample values are correctly broadcast to all annotations. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * Fix _normalize_arrays_for_storage on multi-index rows When the dataframe is multi-indexed, `row.name` is a `(sample_id, annotation_id)` tuple. The code passed that tuple directly to `dataset.get_index_from_sample_id` (which expects a plain sample_id string), causing every array-column normalization to fail with a KeyError. The error was caught and only logged at DEBUG level, but it flooded the log and disabled the target/prediction normalization on every flush. Extract `sample_id = row.name[0]` when `row.name` is a tuple; otherwise fall back to the original row.name. Adds regression test test_normalize_arrays_for_storage_handles_multi_index_row that injects a fake dataset and asserts the sample_id (not the tuple) is passed to `get_index_from_sample_id`. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * cleaning branch * fix usecases examples configs and python file * fix grpc and agent interface with multi-indexing * fix multi-indexing issues * fix ext files * Add cat. tag management and fix tests * Fix documentation * Fix h5 compat. with multi-indexing * Fix utests and add new ones * Fix multi-index issues with trainer gRPC functions; h5 array issues for sync batch idx; and detection tasktype bb; and finally update the documentation * Fix broken utests --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>

* DDP support (WIP): SPMD primitives + 4-plane model + workers-correctness fixes Wraps the existing single-process YOLO ws-detection example for DDP via mp.spawn — no train.py edits needed past the spawn shim. Surfaces: - weightslab/components/ddp_basic_building_blocks.py — SPMD primitives: register_consistent_state, reconcile_all (bundled DOWN broadcast), register_outbox + flush_outbox (bundled UP gather), sync_step (per-step anchor + collective pause-spin). One broadcast + one gather per step. - weightslab/components/ddp_planes.py — the 4-plane model (CONFIG / CONTROL / DATAFRAME / LOGGER) + 5 dtype-keyed reducers (MAX / LATEST / UNION / RANK_0_ONLY / IGNORE) + DOWN_ONLY whitelist for cross-rank DOWN- flowing per-sample columns. - components/global_monitoring.GuardContext — guard_training_context now auto-registers the core states + invokes sync_step on first DDP entry. - data/dataframe_manager.py — shm mirror of DOWN_ONLY columns visible to DataLoader subprocess workers via fork; per-cell value-change gate so rank-N's idempotent reconcile applies don't thrash worker resets; iter invalidation triggers on real DOWN_ONLY mutations only. - backend/dataloader_interface.py — WeightsLabDataSampler composes DistributedSampler under DDP; sampler reads the shm mirror at yield time (fork-safe); DataLoaderInterface gains _invalidate_iter to drop prefetched stale batches on the trainer thread (avoids the std::terminate crash from worker shutdown on a non-owning thread). - trainer/services/experiment_service.py — RestoreCheckpoint passes force=True (data snapshot was silently skipped when hashes appeared equal) and re-pauses post-load (saved hp had is_training=True, which load_state's register_hyperparams would otherwise re-apply). - components/checkpoint_manager.py — three reset_iterator sites route through the lazy _invalidate_iter path under DDP+workers. - examples/PyTorch/ws-detection/src/main_ddp.py — spawn shim worker. - examples/PyTorch/ws-detection/src/yolo_pipeline.py — extracted YOLO pipeline (replaces the older ddp_smoke._build_pipeline / decode helpers). - examples/PyTorch/ws-detection/src/ddp_test_suite.py — 8-scenario gRPC integration suite: epoch_then_pause, discard_subset_freezes, break_by_slice, lr_batch_propagate, checkpoint_data_roundtrip, signal_coverage_all_graphs, resume_continues_curve, process_topology. - tests/test_ddp_primitives.py — trivial 3-rank gloo verification of reconcile_all (convergence + idempotency + change propagation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add 4 gap-coverage scenarios + collective-budget instrumentation 12 scenarios green end-to-end (8 original + 4 new). The new ones cover gaps that were missing direct verification: - scenario_multi_epoch_stability — 3 epochs back-to-back, asserts (sid, model_age) entries are unique per graph (idempotent dedup at the outbox merge) + age is strictly monotonic across epochs. Catches the regression where outbox flushes would append rather than upsert per-step. - scenario_empty_shard_starvation — discards ~95% of populated samples; asserts the trainer does NOT silently hang at the next grad all-reduce when one rank's shard ends up empty. Verifies loader cycle-and-skip semantics under heavy filtering. - scenario_seed_determinism — two consecutive break_by_slice pulls of the per-sample loss history return byte-identical (sid, age, val) triples. Catches stochasticity leaks in the read path that would silently break the loss-shape descriptor downstream. - scenario_collective_budget — programmatically asserts that every training step uses EXACTLY 2 collectives (1 reconcile_all broadcast + 1 flush_outbox gather). Hard perf gate against future regressions that add a stray dist.broadcast / dist.all_reduce in a hot path. Requires a small SDK hook: WL_DDP_COLLECTIVE_LOG=<path> appends the prior step's count to a file from inside reset_collectives() — opt-in, no overhead when unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add scenario_curate_lifecycle — end-to-end UI curation flow Tests the realistic multi-edit workflow under DDP: epoch 1 → tag 3 suspects → discard them → epoch 2 → un-discard + tag 'verified' → epoch 3 → assert loss trajectory shows the gap. Assertions: [1] LIFECYCLE — for each suspect: pre-discard entries exist, NO entry in the (discard_age, undiscard_age] window for any of them (this is the proof that discard reached the workers' shm + sampler fast-path), AND ≥1 suspect resumes post-undiscard. [2] TAG COMPOSE — break_by_slice('verified') returns all 3 suspects (proves multi-tag stacking on the same sample). [3] PLOT METRICS — scalar plot has ≥3 epochs worth of points. Side change: Client.discard now accepts discarded=False to un-discard via the same EditDataSample RPC. This brings the suite to 13 scenarios, all green at WL_DDP_BATCH=4, WL_DDP_WORKERS=0, WL_DDP_WORLD_SIZE=2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Cleanup: drop unused primitives + document drop_last=False rationale Removes ~155 LOC of dead-code surface that accumulated during the WIP push: - weightslab/components/ddp_basic_building_blocks.py: drop aggregate_up decorator, replicate_down decorator, reconcile_down (single-state hook), plus the combine-helpers (_concat, combine_rank0, make_concat_combine) that only existed to serve aggregate_up. The outbox/flush pattern superseded ① aggregate_up for per-sample hot writes (one gather/step instead of one per call); reconcile_all replaced reconcile_down (bundled broadcast); replicate_down was never invoked. Zero external references to any of them. Net -93 LOC. - weightslab/utils/tools.py + utils/__init__.py: drop DistributedCounter (CUT-tagged in the design notes, never adopted). Net -62 LOC. - weightslab/backend/dataloader_interface.py: keep drop_last=False on the DistributedSampler, but document why. Padded yields are real training events that land in the loss trajectory as real (sid, model_age, value) encounters with distinct ages from the sample's earlier yield; the trajectory is encounter-keyed, not per-epoch-unique-keyed, so padding is honest rather than pollution. drop_last=True was considered but rejected as too trivial — it'd silently drop the trailing (world-1) samples each epoch and bias coverage downward. Verified: scenario_discard_subset_freezes PASS at WL_DDP_BATCH=4, WL_DDP_WORKERS=0, WL_DDP_WORLD_SIZE=2, WL_DDP_TEST_STEPS=20 — 156 populated samples, 5 discards held frozen across epoch 2, ~80% advance on non-discarded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Rename ddp_planes → parallel_state, ddp_basic_building_blocks → parallel_primitives Generalizes naming away from "DDP" since the primitives don't assume the specific torch.distributed-DDP topology — they hold for FSDP / ZeRO / any SPMD setup with a rendezvous-on-collective contract. - weightslab/components/parallel_primitives.py (was ddp_basic_building_blocks) - weightslab/components/parallel_state.py (was ddp_planes) - docs/ddp_design.md (was components/ddp_design_notes.md) Updated import sites (4): global_monitoring.py, dataframe_manager.py, parallel_primitives.py (self-ref), tests/test_ddp_primitives.py. Also: scenario_lr_batch_propagate threshold loosened from `(expected + rank0_only) / 2` to `rank0_only + 1`. The old midpoint sat right at the noise band; under drop_last=False with mid-iter batch-size transitions the observed rate floats 13–14 samples/step, occasionally tripping at exactly 13.75 < 14. New threshold cleanly distinguishes "both ranks doubled" from "only rank-0 doubled" (rate ~12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix load_state to preserve model identity on same-arch restore ledgers.register_model(new_model) replaced the registered object, orphaning any captured reference (e.g. `model = trainer.model` in a training loop, or DataLoaderInterface.self.model). Post-restore the trainer trained a stale model while pause-checks read the fresh one, so pause_at_step never fired — caught by scenario_resume_continues_curve in the DDP suite. Skip register-replace + guard updates when existing model has same keys AND shapes as the saved weights; let apply-weights load in-place. Add a regression assertion in test_06 that captures the wrapped model identity across load_state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Rewrite docs/ddp_design.md as a concise design overview Cut 159 → 66 lines (~58% smaller, 1426 → 588 words). Drops decision-tag scratchpad ([DECIDED]/[OPEN]/[DEFER]/[CUT]), per-state placement debate, wrapper-prologue code sketches, and the open-questions section — all process residue. Keeps: - Two-space framing (train-space vs sdk-space, kernel/user analogy). - SPMD-with-one-privileged-rank constraint. - Two-kinds-of-sync framing (grad reduction = off-the-shelf; async UI = WL's job). - The loop-iteration-as-transaction insight and the train→sdk transition as the consistency boundary. - State × direction table. - DOWN broadcast / UP outbox / shm mirror mechanisms with API entry points (register_consistent_state / register_outbox + the anchor functions). - Collective budget (~2 rendezvous/step) + instrumentation env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Outbox ships per-step deltas, not full snapshots local_df_writes / local_signal_triples emitted the WHOLE dataframe and the WHOLE per-sample signal history every step, gathered to rank-0 each step. The ~2-collectives/step budget bounds the COUNT of rendezvous but not their bytes, so payload scaled with N_samples x world (df) and grew unboundedly (signals) — the real scaling wall. Each rank now dumps only what changed since its last flush: changed dataframe rows (vs a process-local _LAST_SENT_DF signature cache) and signal triples past a per-(graph, exp_hash) cursor read straight off the append-only buffers. On respawn/restore the cache resets to a one-time full resend, safe because every merge is idempotent. merge_df_writes seeds rank-0's current values first (existing-first) before the per-column reducer, so a delta that omits a sample rank-0 already holds a higher value for cannot regress MAX/UNION, while LATEST still resolves to the newest delta. clear_registry resets the caches. Docs + outbox comment updated to describe the delta transport and clarify the budget governs collective count, not payload; shm section corrected to note only the bool deny-list (`discarded`) is read at __getitem__, not user_tags. Validated: scenario_signal_coverage_all_graphs PASS (per-sample 940/940 across both ranks; test_ddp_primitives 3-rank reconcile PASS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Align DDP module docstrings with ddp_design (remove stale primitives) The parallel_primitives module docstring advertised three decorator-form primitives — AGGREGATE_UP, REPLICATE_DOWN — and a reconcile_down single-state hook, none of which exist in the code (grep finds them only in the docstring). The implemented + design-doc'd surface is two mechanisms: register_consistent_ state/reconcile_all (DOWN) and register_outbox/flush_outbox (UP), plus the shm deny-list mirror and the sync_step anchor. Rewrote the docstring to match that and dropped the ①/②/③ numbering everywhere (parallel_state plane table and the inline anchor comments now use plain DOWN/UP, matching docs/ddp_design.md). Comment/docstring-only; no logic change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add fast unit tests for the outbox delta optimization Proves the delta change's correctness in ~30ms instead of via the multi-minute YOLO scenarios: df_writes emits only changed rows; merge_df_writes seeds rank-0's current value so a stale/lower delta can't regress a MAX column (last_seen) while a higher one still raises it and LATEST picks the newest delta; local_signal_ triples advances a per-(graph, exp_hash) cursor and resends from 0 if the buffer shrank under it (restore safety). No DDP spawn — ledger getters are monkeypatched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Document DDP sampler sharding is training-only (eval is latent-unsupported) The sampler shards EVERY loader, but the per-step anchor runs only in training context — so a sharded eval under DDP would have each rank score 1/world with no scalar-metric aggregation (undercount). No eval runs under DDP today, so this is latent; added a TODO(ddp-eval) guard at _get_dist_sampler so nobody adds a DDP eval loop without first resolving the eval sharding/aggregation policy. Comment-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Cap shm deny-list index so large/sparse uids don't blow allocation The DOWN-only shm vec is indexed directly by int(sample_id), so a single large sparse uid (e.g. an inode-based id ~1e8) allocated a ~100MB bool array. Cap the index at 1<<22 (~4M → ~4MB max per origin/col): ids at/above the cap skip the shm fast-path and fall back to the sampler's pandas deny-list check, which is the actual read site (the shm is read in the main-process sampler, not workers — docstring corrected to say so). Dense 0..N id schemes keep the fast-path. Warns once per origin when the cap is hit. Adds test_ddp_shm_cap: huge id neither allocates a giant array nor breaks the small-id fast-path; undiscard clears the cell. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Restrict shm mirror to boolean deny-list columns (skip user_tags) _propagate_to_shm ran bool(val) over every DOWN_ONLY column into the bool shm vec, so user_tags (a list column) stored a meaningless bit that nothing reads (the sampler only queries 'discarded'). Filter the mirror to genuinely boolean columns via _shm_bool_eligible: bool dtype, or an object column whose non-null cells are all bool / 0-1 scalars. user_tags still reconciles to children via the DOWN broadcast; it just no longer gets a bogus shm array. Test: a user_tags list column allocates no shm array while discarded still does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Decouple DDP shard reshuffle from iterator reset (epoch → reshuffle_seq) _generate_indices auto-advanced the DDP epoch on every fresh iterator, so a mid-loop discard/tag — which forces an iterator reset — reshuffled the entire per-rank shard. In a curation workflow (discard-heavy by design) the shard order churned on every discard, and "epoch" counted iterator resets, not dataset passes — a meaningless abstraction here. Rename _epoch → _reshuffle_seq (it's a reshuffle generation, not a pass count) and stop auto-advancing it. The reshuffle generation now advances ONLY on a genuine pass-end reset (loader calls sampler.advance_reshuffle() on the _epoch_exhausted path); the discard-invalidation reset path leaves it untouched, so a discard re-filters the SAME permutation instead of reshuffling. Reproducibility across resets is preserved but composed correctly: the per-rank permutation is a pure fn of (ddp_seed, reshuffle_seq, rank, world), so capture/restore_iteration_state now save/restore reshuffle_seq + seed; combined with samples_yielded (offset) and the deny-list (checkpointed as a DOWN_ONLY df column) this reproduces the exact filtered stream. Warns on a seed mismatch. Side effect: also fixes the __len__-vs-iteration epoch off-by-one — neither _generate_indices nor _rank_indices_snapshot advances during iteration now, so they read the same generation. Adds test_ddp_reshuffle_seq: re-gen without advance is stable; advance reshuffles; restore reproduces; ranks partition disjoint+cover; seed mismatch warns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update README.md * Guard user callbacks at collective boundaries so one can't hang the group reconcile_all built its snapshot ({name: snap()}) and applied children states unguarded, and flush_outbox built its payload ({name: dump()}) unguarded. Under SPMD a callback that raises on one rank but not others crashes that rank BEFORE (snapshot/dump) or AFTER (apply) a collective, leaving every other rank blocked forever on the broadcast/gather. Wrap each snap()/apply()/dump() in try/except (merge() already was), so the collective itself is ALWAYS reached: a failed state/channel ships as None (apply/merge already tolerate None) and the rest sync normally. Matches the DDP module's existing swallowed-exception style (logger.debug("[tag] ... failed: %s", exc)). Also switch parallel_primitives' logger from getLogger("weightslab.ddp") to getLogger(__name__) to match the rest of the codebase. Adds test_ddp_collective_resilience: a "bad" state (snapshot raises on rank 0, apply raises on children) + a "bad" outbox (dump/merge raise) run two anchor rounds + a barrier without hanging, and the healthy state still syncs on all ranks. Original test_ddp_primitives still passes (no reconcile regression). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add world>2 uneven-shard coverage to the reshuffle test n=25, world=3 exercises DistributedSampler's drop_last=False padding: each shard pads to length 9 and the union still covers the whole universe. Complements the even world=2 disjoint+cover case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Remove the redundant shm deny-list mirror The shared-memory DOWN_ONLY mirror was never load-bearing: the deny-list is enforced sampler-side (a discarded sample is never yielded), the sampler's pandas cache already refreshes on the deny-list revision bump (so a live discard is reflected within one index), and a sample already in a worker's prefetch queue is dropped by iterator invalidation tearing the workers down — NOT by shm (shm filtered at yield time, before the discard existed, so it never had power over queued samples). No test depended on it; nothing read it except the main-process sampler. Remove _propagate_to_shm / _ensure_shm_capacity / _shm_bool_eligible / is_in_down_only_shm + the shm fields and the sampler's shm fast-path (~190 LOC, plus the ctypes/multiprocessing/os imports). Replace the invalidation gate — which must stay gated on an ACTUAL value change so rank-1+ don't respawn workers every step under DDP — with a pandas before/after diff (_down_only_changed) computed before the upsert merges. Update docs (design + comments) to describe sampler-side enforcement + invalidation, dropping the inaccurate "workers fork-read the shm" story. Supersedes the shm-cap (task #3) and user_tags-shm (task #4) fixes. Drops test_ddp_shm_cap; adds test_ddp_down_only_change covering the gate, including the DDP no-respawn invariant (re-applying the same snapshot → no change). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * DDP: rebalance-on-discard sharding, drop per-flush image decode, anchor/delta cleanups - Sampler: replace shard-then-filter with a filter->pad->stripe REBALANCE so live shards are always equal length across ranks — fixes the empty-shard deadlock (scenario_empty_shard_starvation) by construction; order-preserving + deterministic (not a reshuffle). persistent_workers=True makes a discard/undiscard rebuild a cheap reset (reuse workers, drop stale prefetch) instead of a fork+reinit. drop_last=False under DDP keeps a tiny live set training so age still advances. - dataframe_manager: stop the storage-time bbox->seg get_mask conversion — it re-decoded a full image per signal flush (~13% of rank-0 wall) and silently corrupted detection prediction_raw; mask rendering stays available on-demand via get_prediction_mask. Remove the now-dead normalize/_is_array_column/_get_loader. - Anchor split DOWN(__enter__)/UP(__exit__); outbox ships per-rank deltas via a writer dirty-set; remove post-hoc active-sample masking from the model wrapper. - Test suite: compute epoch_steps from WL_DDP_BATCH (was config's mono batch=4 while the loader trained at 16 -> every "epoch" silently covered ~4 passes); add scenario_progressive_resample (shrink->grow) + per-phase timing + a WL_DDP_SELFSPY self-profiling hook. - docs/ddp_design.md: document rebalance-not-reshuffle + persistent-worker reset. Note: 2 coverage assertions (epoch_then_pause populated~=shard, progressive_resample advance%) still assume the old over-training and read the per-sample gather before it fully settles for a now-correct single epoch; recalibration deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * DDP: make WL_DDP_BATCH effective, event-based pause (no sleep), resumable suite - yolo_pipeline: push WL_DDP_BATCH into the in-memory cfg, not just the DataLoader ctor. _sync_batch_size_from_ledger re-applies the ledger's batch every iteration, so without this the loader silently reverted to config's mono batch=4 and the env was dead — the suite trained at 4 while epoch_steps assumed 16 (¼-epoch coverage, which failed epoch_then_pause / progressive_resample after the epoch-math change). Now img=16 for real: full single-epoch coverage + genuine ~23% speedup (4× fewer steps, same work). config.yaml on disk untouched (mono unaffected). - parallel_primitives / global_monitoring: kill the 20ms busy-sleep pause-spin in sync_step. Rank 0 blocks on the pause_controller resume Event (new wait_for_resume); rank-1+ block inside the next reconcile_all broadcast. Neither spins (gloo socket- waits; NCCL would busy-spin — noted). Bounded timeout kept for SIGINT/SIGTERM responsiveness, not polling. - ddp_test_suite: WL_DDP_SKIP (comma-sep substrings) so a killed run resumes by skipping already-passed scenarios. Full 14-scenario suite green at batch-16 (incl. empty_shard_starvation, progressive_resample, and the event-based pause via epoch_then_pause). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * DDP: trim verbose comments (no behavior change) Condense the "novel"-length comments added during the DDP work to 1-2 tight sentences each — worst offender was dataloader_interface (rebalance/reshuffle, __iter__, __len__, persistent_workers, _reset_iterator). Key invariants kept; redundant restatements dropped. ~80 fewer comment lines. Code unchanged (reshuffle unit test still green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * DDP: delta the DOWN reconcile + vectorize the UP merge (anchor 2x faster, no O(N)/step) Two hidden O(N)-per-step costs in the cross-rank anchor, both removed: - DOWN reconcile (rank0_df_down_state) rebuilt+pickled+broadcast {col:{sid:val}} for ALL non-null DOWN_ONLY cells every step (discarded defaults False=non-null, so all touched samples rode) — O(N). Now a DELTA: rank-0 ships only the sample-ids whose DOWN_ONLY changed since the last reconcile (drain_down_delta dirty-set, populated in upsert_df on a real DOWN diff), with one full snapshot on first reconcile / post- restore (mark_down_full_resend, hooked in _load_existing_data) so children converge. N-sweep: full build+pickle was 1.5ms@1k / 119ms@100k / 619ms@500k; delta ~0 when unchanged, O(discards) otherwise. - DOWN_ONLY trimmed to {"discarded"} — "user_tags" was never a real column (it's "tag"), tags are rank-0 UI state (tag->label override is vestigial), and tag queries gather signals UP + filter on rank-0, so nothing tag-shaped needs to reach a sampler. - UP merge (merge_df_writes): replaced per-column groupby.apply(python reducer) with one vectorized groupby.agg({col:'max'/'last'}) — _r_max/_r_latest map exactly to skipna max/last; policy_for only yields MAX/LATEST here (UNION is tags, DOWN-filtered). And _rank0_existing_seed stopped copying the WHOLE dataframe every flush — it now indexes just the delta's ~batch rows. Anchor 168 -> 153 (DOWN delta) -> 78.6 ms/step (merge), ~2x. Validated: discard_subset_freezes, progressive_resample, break_by_slice, curate_lifecycle, signal_coverage all PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add ddp_ablation.py — 3-mode WL SDK overhead harness (time/mem/IO/bytes per rank) WL_ABLATE=ul|wlimport|wl on 2 gloo ranks, WL_ABLATE_STEPS configurable. Per mode: per-section ms/step, grad bytes/step, per-rank RSS + /proc/self/io (rchar/wchar = syscall bytes incl. gloo sockets; read_bytes/write_bytes = actual disk), and the WL df RAM + H5 store sizes + H5 flush config. 256-step result: WL time tax +247ms/+17.6% (criterions+log +148ms = save_signals + NMS decode-for-logging is the biggest, anchor +89ms, loader/wrapper +28ms); RAM +108MB import-idle + ~40MB active; df/H5 tiny. I/O surfaced that rank-1 redundantly persists ~6MB to H5 (should be outbox-only — rank-0 is authoritative). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Relocate DDP integration suite + perf/ablation into tests/ PR #185 review: tests don't belong in the usecase dir. Move ddp_test_suite.py, ddp_ablation.py, aggregate_wl_ownership.py and the report driver out of examples/PyTorch/ws-detection/src into tests/integrations/ultralytics/ddp/. One god-script (run_ddp_report.sh) with modes info/scenarios/ablation/profile emits a single report: perf counters, per-scenario times, and the wl-ulmanual ablation delta. Scripts path-bootstrap back to the usecase src so yolo_pipeline / utils.* / config.yaml resolve. Added README + .gitignore. Locally-run (needs GPU + dataset) — not a CI unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * PR #185: all_reduce_scalar(avg) + mono/DDP usecase README - utils.tools: add all_reduce_scalar(value, reduction="sum"|"avg"); avg = sum/world since gloo has no ReduceOp.AVG. Keep all_reduce_sum_scalar as a back-compat wrapper. - examples/.../ws-detection/README.md: document the mono (main.py) and DDP (main_ddp.py) usecases, how to run, and the single-GPU gloo DDP simulation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * PR #185: exclude is_training/pause_at_step/root_log_dir from the saved HP snapshot get_HP_snapshot dumped the whole hp dict, so a restore's register_hyperparams( saved_config) resurrected experiment STATE (notably is_training=True) — the bug the post-restore force in experiment_service worked around. Strip the same state-only keys already excluded from the experiment hash, on a copy (never mutate the ledger). The post-restore pause stays, now as the intentional "user drives the next cycle", not a workaround. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Fix ruff F841: drop unused group_ids / pre_restore_max_plot_age / ctx Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Alexandru Rotaru <rotarualexandruandrei94@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…files, heading levels - PRs now formatted as [#N](url) title — date with author GitHub profile links - Contributors section links to github.com/{login} (from PR authors) - Commits capped at 25 most recent non-merge commits - Title: ## **Weightslab** (no version, ## level) - Sections demoted to ### level - Removed separator between title and LinkedIn/Graybx links - Dev release routes PRs from --base dev, main from --base main - Doc build gated on main-branch check (not just tag pattern) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, bb_B], batch_idx=[batch_idA, B, ..etc]

* Add to detection usecase dump history and custom signal labelling * upgrade documentation with new functions and examples * add and fix utests * refactor the logger and add instances history and queries functions, with a user wl.write_history function * add df writing for user during exp * fix code quality issues

GetDataSamples with an empty stats_to_retrieve fell through to the slow per-sample path that serializes every column. For dense tasks (e.g. 3D detection) pred/target are large JSON arrays (~310 KB/record), bloating the response to 100s of MB and silently breaking the histogram fetch. Route empty-stats requests (no image payload, no resize) through the fast vectorized metadata path, defaulting to all columns except heavy blob columns (pred/target/prediction_raw). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…tegration issues with evaluation mode

…ds also; and remove dump model architecture for now

pandas >= 2.1 warns (and will error in a future version) when a partial .loc assignment changes a column's dtype - e.g. assigning object data (numpy arrays / stringified masks) or bool flags into a column that was initialized as float64 via np.nan. - h5_dataframe_store.upsert: add _align_col_dtype_for_assign() to widen the target column to object before partial merges (3 assignment sites). - dataframe_manager._merge_overwrite: widen target to object before assigning object/bool values into a numeric column. - dataframe_manager._apply_updates_frame_locked: keep the up-front categorical widen (uncatchable AssertionError otherwise) and make the masked block assignment promote the warning + widen-to-object and retry, covering array-into-float32 signal writes. Compatible numeric assignments keep their dtype (fast path). No production source emits the warning under -W error; data + ledger suites pass (153 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

guillaume-byte and others added 30 commits May 21, 2026 16:48

add new nworkers tests for perfs

5eda2fc

Drop now-unused threading/pause_controller imports from dataloader_in…

8c501f8

…terface Both were only referenced by _wait_if_paused, removed in 41b76b8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ws-detection: drop bench.py/run_bench.sh from the branch

9f23295

These were local correctness-check tooling for the num_workers sweep, not part of the example. Kept on disk locally (untracked), removed from version control. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #180 from GrayboxTech/fix-ws-det

08b89eb

Fix ws-detection example: restore runnable training loop

Merge pull request #181 from GrayboxTech/framework-intergrations/para…

90f6fde

…llel+distributed+plightning+networkfs Framework intergrations/parallel+distributed+plightning+networkfs

Merge branch 'dev' of https://github.com/GrayboxTech/weightslab into dev

3f44834

Fix lint code quality issue

416774f

Fix ruff and vulture code quality issue

58f72b3

fix utests after merging

834ac7f

Code quality issues

f6d7904

Final code quality check

e43aa1b

change ci code quality as important for CI

eef7a21

Fix dev release CI

1557f9f

update release note format

9953476

Final release note fix

e00bd5c

fix instance_ids bugs matching for batch target format (targets=[bb_A…

1258ebd

…, bb_B], batch_idx=[batch_idA, B, ..etc]

revert merging PR #185

7cf5c0c

Fix bug from reverted Merged Conflicts

861911c

Alexandru Rotaru and others added 27 commits June 14, 2026 18:57

Python version tests in workflows

1bbd971

remove useless certs

67f90a5

update gitignore and add debug print exc

8deb810

add docker initialization materials

e38c4f4

fix logger prints format and spam

72d682e

maximum resolution set to 360p by default

3b88c59

Merge branch 'main' into dev

06e2b28

add auto start in examples

6c3e1d5

Fix main process issues with windows and logs; and fix ultralytics in…

59a45bc

…tegration issues with evaluation mode

Merge branch 'dev' of https://github.com/GrayboxTech/weightslab into dev

058f50b

remove useless dir

117bb4e

Disable the WatchDog; fix evaluate feature in nograd mode and no guar…

ff662f2

…ds also; and remove dump model architecture for now

Fix code quality issue

0bee476

Merge branch 'main' into dev

695dc5a

Fix UL issue with custom EMA and evaluation feature

7b06e0f

Merge branch 'dev' of https://github.com/GrayboxTech/weightslab into dev

6fff830

Merge branch 'main' into dev

cfdbd3f

Allow to run CI from custom branch with correct trigger flag in commit

58ce5ee

Merge branch 'dev' of https://github.com/GrayboxTech/weightslab into dev

093f646

Add agent memory to repo for user

997e914

fix prediction normalization issues

7698e20

Fix ledgered parameters issues

ee9d6e6

Fix new exp. hash after pause 4 the first time

61bc12a

[skip ci] fix parameters wrapping in examples

eca9e25

Merge branch 'main' into dev

ceef8c3

guillaume-byte merged commit 9f4e47a into main Jun 18, 2026
6 checks passed

guillaume-byte changed the title ~~v1.2.5~~ v1.2.6 Jun 18, 2026

guillaume-byte temporarily deployed to github-pages June 18, 2026 16:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.6#208

v1.2.6#208
guillaume-byte merged 99 commits into
mainfrom
dev

guillaume-byte commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guillaume-byte commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants