Support sample-level task_type for multimodal models (cls/seg/det across 2D/3D images and point clouds)

Summary

Allow a single experiment / model / dataset to handle mixed sample modalities and task types by making task_type a sample-level property. The model/experiment should be able to accept samples of these input modalities: 2D image, 3D image (volumetric), 2D point cloud (projected / organized), 3D point cloud — and produce any combination of outputs per sample: classification (cls), segmentation (seg), and detection (det). All modalities and tasks must be manageable in one experiment and one dataset.

Motivation

- Current task_type often treated as a dataset/experiment-level property which prevents mixing modalities and tasks inside the same run.
- We want a flexible training/evaluation pipeline and model that can learn from and output different task types per sample (e.g., some samples with cls+seg, others with image_det, others with ptCloud_det).
- This simplifies multi-task, multi-modal research workflows and supports unified logging/metrics for experiments that combine 2D/3D and image/point-cloud data.

Proposal

1) Sample schema
- Add/standardize a sample-level field `task_type` (string or enum) describing the sample's required outputs. Examples: "image_cls", "image_seg", "image_det", "vol3d_cls", "vol3d_seg", "vol3d_det", "pc2d_cls", "pc2d_seg", "pc2d_det", "pc3d_cls", "pc3d_seg", "pc3d_det".
- Add a `modalities` field that lists available inputs in the sample (e.g., ["image2d"], ["depth_vol"], ["pc3d"], ["image2d","pc3d"]).
- Add `targets` object that contains the ground-truth data keyed by task/subtype (e.g., targets.class, targets.segmentation, targets.bboxes, targets.pc_bboxes). Each target should be optional and present only when the sample's task_type requires it.
- Allow `task_type` to be set manually or inferred automatically from the presence of specific target fields (auto-detection fallback).

2) Dataset format & ingestion
- When ingesting datasets, prefer reading `task_type` from the sample metadata when present.
- If absent, auto-detect `task_type` with a deterministic rule based on present `targets` fields (e.g., presence of "masks" => image_seg; presence of "bboxes" with 2D image input => image_det; presence of point-cloud bbox fields => pc3d_det).
- Maintain backward compatibility: if dataset-level task_type exists, allow it as a default for samples that do not specify `task_type`.

3) Experiment / model API
- Experiment batch sampler must be able to create mixed batches (or stacked micro-batches) where each sample carries its task_type and targets.
- Model forward must support conditional heads activated per-sample (or batch element), or use multi-head outputs where losses are masked by sample task_type.
- Training loop should compute losses only for the outputs relevant to each sample's task_type and aggregate them consistently (weighted sum, separate optimizers optional).
- Evaluation/metrics pipeline should compute per-task and aggregate metrics and allow filtering by modality/task_type.

4) Seg rendering from task_type
- When rendering segmentation annotations (for visualization or loss), use the sample's `task_type` to select which renderer to use (image masks vs. voxel masks vs. point-cloud labels).
- Provide a shared renderer interface that maps task_type to a rendering function.

5) Detection (image and point cloud)
- Support both image_det and ptCloud_det targets. Use unified bbox/annotation schema with modality-specific coordinate frames.
- Document coordinate transforms expected by evaluation modules.

Acceptance criteria

- [ ] Dataset samples can carry `task_type` values that inform downstream ingestion and training pipelines.
- [ ] Automatic task_type inference works when sample `task_type` is missing and follows deterministic rules.
- [ ] Model/experiment supports mixed batches with per-sample task masking and computes per-sample losses correctly.
- [ ] Visualizer and evaluation support segmentation rendering and detection evaluation for each modality based on `task_type`.
- [ ] Backward compatibility: existing dataset-level task_type continues to work as a default.

Implementation notes / suggestions

- Add enums/constants for supported task_types in a central place (e.g., weightslab.datasets.tasks.TaskType).
- Make `targets` a dict with well-known keys: `class`, `mask_image`, `mask_volume`, `bboxes_2d`, `bboxes_3d`, `pc_labels`, etc. Keep values optional.
- For training, have a wrapper Batch class that exposes per-sample masks for which loss terms should be computed.
- Consider micro-batching by task_type when mixed batches cause inefficient compute (e.g., different heads require different preprocessing).
- Provide utilities to migrate legacy datasets: a migration script that reads dataset metadata and emits sample-level `task_type` when safe.

Risks / open questions

- Mixed-modality batches may complicate GPU memory patterns and cause inefficiencies; we might need to allow batching by dominant modality/task_type.
- Standardizing target field names requires careful migration and clear docs to avoid dataset errors.
- Deciding the canonical set of task_type strings needs agreement.

Next steps

1. Design the sample JSON schema and add to repository docs (examples for each modality).
2. Add task_type enum and target keys as constants in the codebase.
3. Implement dataset ingestion changes and automatic inference rules.
4. Update model training loop and evaluators to support per-sample masking.
5. Add tests (unit and integration) and a migration tool for legacy datasets.

Notes from user

Task type management should be sample related, i.e., seg rendering from tasktype, cls, image_det, ptCloud_det, all from samples (manual or auto). The goal is one experiment, one model, one dataset that can include 2D image, 3D image, 2D point cloud, 3D point cloud samples and produce cls/seg/det as appropriate per sample.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sample-level task_type for multimodal models (cls/seg/det across 2D/3D images and point clouds) #198

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support sample-level task_type for multimodal models (cls/seg/det across 2D/3D images and point clouds) #198

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions