Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- `Compute.filteredSum(filterColumn, predicate, aggColumn)` fuses a filter and a sum into a single scan — a row folds into the total only when the predicate selects it (a null filter row is excluded) and the aggregate value is non-null — with no intermediate selection bitmap. It matches a hand-written fused loop and is ~1.5× faster than the two-pass `filter` + `sum`. ([57d2225b](https://github.com/dfa1/vortex-java/commit/57d2225b))
- `Compute.filteredAggregate(chunk, filter, aggColumn)` fuses a whole multi-column `RowFilter` (an n-ary `AND` of column-bound predicate leaves) and folds the selected rows' `SUM`/`MIN`/`MAX`/non-null count over an aggregate column in a single pass — the multi-column counterpart of `filteredSum`, and the row-level kernel behind the Calcite boundary-chunk aggregate push-down. A `null` aggregate column counts selected rows only (`COUNT(*)`). ([2ba54888](https://github.com/dfa1/vortex-java/commit/2ba54888))
- `core.model.LayoutId` — typed layout identity with the same sealed shape as `EncodingId` (`WellKnown` constants plus `Custom`; layouts are runtime-pluggable in the reference implementation). The reader now recognizes `vortex.zoned`, the current canonical zone-map layout id in the Rust reference, alongside the legacy `vortex.stats` alias it keeps writing — files from current Rust writers scan and prune correctly. ([7df3a0db](https://github.com/dfa1/vortex-java/commit/7df3a0db))
- Layout decode is pluggable: `LayoutDecoder` + `LayoutRegistry` (`reader.layout`) mirror the encoding registry — `LayoutRegistry.builder().registerDefaults().register(custom).build()` passed to the new `VortexReader.open(path, readRegistry, layoutRegistry)` / `VortexHttpReader` overloads dispatches every layout decode, container children included, through the registry. Programmatic registration only (no service file); unknown layouts fail loudly. Zone-map pruning and filtered scans recognize the built-in layouts only. ([fc488d04](https://github.com/dfa1/vortex-java/commit/fc488d04), [dd196f17](https://github.com/dfa1/vortex-java/commit/dd196f17))

### Changed

Expand All @@ -19,6 +21,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- A multi-column `AND` filter no longer forfeits the dictionary lane: the dict-encoded leaf drives the code scan and the remaining predicates are evaluated only on its matches — `SUM(…) WHERE category = 7 AND price > 500` over 100M rows drops from ~2.3 s to ~200 ms (~11×). ([12e13501](https://github.com/dfa1/vortex-java/commit/12e13501))
- `core.model.EncodingId` is now a sealed interface: the spec constants live in the nested `WellKnown` enum (re-exported, so `EncodingId.VORTEX_FOO` call sites compile unchanged) and `Custom` wraps any other wire string, which for the first time lets third-party `EncodingDecoder`/`EncodingEncoder` implementations declare ids outside the spec set. `parse` is total over non-blank ids — an unknown id yields a typed `Custom` instead of an empty `Optional`. ([ea88a91b](https://github.com/dfa1/vortex-java/commit/ea88a91b))
- `reader.decode.ArrayNode` is a single record carrying the typed `EncodingId`; the `KnownArrayNode`/`UnknownArrayNode` split and the `ArrayNode.of` factory are gone. Decode dispatch, the `allowUnknown` passthrough, and error messages are unchanged. A crafted file with a blank encoding id now fails as `VortexException` instead of escaping as `IllegalArgumentException`. ([21810d7e](https://github.com/dfa1/vortex-java/commit/21810d7e))
- `Layout` and `ZonedStatsSchema` moved to the new `reader.layout` package, and `Layout`'s misnamed `String encodingId` component is now `LayoutId layoutId`. Unknown layouts still fail loudly, now with a typed id in the error. ([7df3a0db](https://github.com/dfa1/vortex-java/commit/7df3a0db), [b08ace79](https://github.com/dfa1/vortex-java/commit/b08ace79))
- `UnknownArray.encodingId` is a typed `EncodingId` instead of a raw string — a `Custom`, or a `WellKnown` whose decoder is not registered. ([7588aa31](https://github.com/dfa1/vortex-java/commit/7588aa31))

## [0.11.0] — 2026-06-28

Expand Down Expand Up @@ -166,7 +170,7 @@ A hardening release: no new file-format capability, but a large step up in verif

- Zero-warning rule: `-Xlint:all -Werror` across all modules. The `classfile` lint (which only flags missing annotation class files inside third-party Arrow bytecode) is scoped off in the two Arrow-using modules only. ([dab467e5](https://github.com/dfa1/vortex-java/commit/dab467e5), [43f6f840](https://github.com/dfa1/vortex-java/commit/43f6f840))
- Mutation testing (PIT): opt-in `pitest` profiles in core, reader, and writer, scoped to the bounds/parse classes (`IoBounds`, `PTypeIO`, `WriteRegistry`, `ChunkImpl`, …), with common config hoisted into the parent POM. ([46904b24](https://github.com/dfa1/vortex-java/commit/46904b24), [ed8c98a1](https://github.com/dfa1/vortex-java/commit/ed8c98a1), [1200c76b](https://github.com/dfa1/vortex-java/commit/1200c76b), [840cc46a](https://github.com/dfa1/vortex-java/commit/840cc46a))
- SonarCloud: generated `fbs/` and `proto/` sources excluded from analysis (machine output, not hand-maintained); the deliberate per-width SIMD-loop duplication is documented in [ADR 0005](docs/adr/0005-vector-api-adoption.md) rather than refactored away. Code smells dropped 857→394; coverage ~81%, all ratings A, zero bugs/vulnerabilities. ([6c591293](https://github.com/dfa1/vortex-java/commit/6c591293))
- SonarCloud: generated `fbs/` and `proto/` sources excluded from analysis (machine output, not hand-maintained); the deliberate per-width SIMD-loop duplication is documented in [ADR 0005](adr/0005-vector-api-adoption.md) rather than refactored away. Code smells dropped 857→394; coverage ~81%, all ratings A, zero bugs/vulnerabilities. ([6c591293](https://github.com/dfa1/vortex-java/commit/6c591293))

### Tests

Expand All @@ -181,7 +185,7 @@ Read and write Vortex Variant (semi-structured, JSON-shaped) columns from Java.

### Added

- Writer: `vortex.variant` encoder. Encodes a variant column as the canonical `vortex.variant` container over `core_storage` — an all-equal column becomes a single `vortex.constant`, a row-varying column a `vortex.chunked` of per-run constants — with an optional row-aligned typed `shredded` child recorded in `VariantMetadata.shredded_dtype`. Input is `VariantData(List<Scalar>)` with `.constant(n, v)` / `.shredded(...)` factories. Java↔Rust (JNI) round-trip verified for constant, row-varying, and shredded columns. Scalar values only — arbitrary nested objects need `vortex.parquet.variant` (deferred, [ADR 0014](docs/adr/0014-variant-encoding-strategy.md)). ([35da529d](https://github.com/dfa1/vortex-java/commit/35da529d), [e4e44980](https://github.com/dfa1/vortex-java/commit/e4e44980), [4566dca0](https://github.com/dfa1/vortex-java/commit/4566dca0))
- Writer: `vortex.variant` encoder. Encodes a variant column as the canonical `vortex.variant` container over `core_storage` — an all-equal column becomes a single `vortex.constant`, a row-varying column a `vortex.chunked` of per-run constants — with an optional row-aligned typed `shredded` child recorded in `VariantMetadata.shredded_dtype`. Input is `VariantData(List<Scalar>)` with `.constant(n, v)` / `.shredded(...)` factories. Java↔Rust (JNI) round-trip verified for constant, row-varying, and shredded columns. Scalar values only — arbitrary nested objects need `vortex.parquet.variant` (deferred, [ADR 0014](adr/0014-variant-encoding-strategy.md)). ([35da529d](https://github.com/dfa1/vortex-java/commit/35da529d), [e4e44980](https://github.com/dfa1/vortex-java/commit/e4e44980), [4566dca0](https://github.com/dfa1/vortex-java/commit/4566dca0))
- Reader: variant columns now decode Java-side. `ConstantEncodingDecoder` and `ChunkedEncodingDecoder` handle `DType.Variant` (materializing the inner-typed array); `VariantEncodingDecoder` wraps the result as `VariantArray`, exposing `coreStorage()` and `shredded()`. ([76e4c741](https://github.com/dfa1/vortex-java/commit/76e4c741), [4566dca0](https://github.com/dfa1/vortex-java/commit/4566dca0))

### Security
Expand All @@ -196,7 +200,7 @@ Read and write Vortex Variant (semi-structured, JSON-shaped) columns from Java.

### Changed

- Decode shape: transform encodings now decode **lazy-only**. The eager `Materialized*Array` fallbacks were removed from `vortex.zigzag` (all PTypes + broadcast, [cd59fefa](https://github.com/dfa1/vortex-java/commit/cd59fefa)), `fastlanes.for` (all integer PTypes, [d7953e1f](https://github.com/dfa1/vortex-java/commit/d7953e1f)), `vortex.alp` (broadcast-without-patches, [deab8067](https://github.com/dfa1/vortex-java/commit/deab8067)), `vortex.constant` (Decimal → `LazyConstantDecimalArray`, [a6a9611e](https://github.com/dfa1/vortex-java/commit/a6a9611e)), `vortex.runend` (Bool → `LazyRunEndBoolArray`, [0bbcb81f](https://github.com/dfa1/vortex-java/commit/0bbcb81f)), `vortex.sparse` (Bool → `LazySparseBoolArray`, [db2e955b](https://github.com/dfa1/vortex-java/commit/db2e955b)), and `fastlanes.rle` (validity → `OffsetBoolArray`, empty → `LazyConstantXxxArray`, [5e83a5c3](https://github.com/dfa1/vortex-java/commit/5e83a5c3)). Decompression encodings (`bitpacked`, `pco`, `zstd`, `fsst`, `delta`, `patched`), the primitive base, the `vortex.dict` encoding-level path, and the `vortex.alp` patches path stay Materialized by design. See [ADR 0015](docs/adr/0015-drop-materialized-fallbacks.md).
- Decode shape: transform encodings now decode **lazy-only**. The eager `Materialized*Array` fallbacks were removed from `vortex.zigzag` (all PTypes + broadcast, [cd59fefa](https://github.com/dfa1/vortex-java/commit/cd59fefa)), `fastlanes.for` (all integer PTypes, [d7953e1f](https://github.com/dfa1/vortex-java/commit/d7953e1f)), `vortex.alp` (broadcast-without-patches, [deab8067](https://github.com/dfa1/vortex-java/commit/deab8067)), `vortex.constant` (Decimal → `LazyConstantDecimalArray`, [a6a9611e](https://github.com/dfa1/vortex-java/commit/a6a9611e)), `vortex.runend` (Bool → `LazyRunEndBoolArray`, [0bbcb81f](https://github.com/dfa1/vortex-java/commit/0bbcb81f)), `vortex.sparse` (Bool → `LazySparseBoolArray`, [db2e955b](https://github.com/dfa1/vortex-java/commit/db2e955b)), and `fastlanes.rle` (validity → `OffsetBoolArray`, empty → `LazyConstantXxxArray`, [5e83a5c3](https://github.com/dfa1/vortex-java/commit/5e83a5c3)). Decompression encodings (`bitpacked`, `pco`, `zstd`, `fsst`, `delta`, `patched`), the primitive base, the `vortex.dict` encoding-level path, and the `vortex.alp` patches path stay Materialized by design. See [ADR 0015](adr/0015-drop-materialized-fallbacks.md).
- **Breaking — sealed `Array` permits changed.** `DecimalArray` is now a `non-sealed` family interface (decimal arrays moved from `implements Array` to `implements DecimalArray`), so decimal joins the per-dtype family layer. Downstream exhaustive `switch` over `Array` must add a `case DecimalArray`. ([a6a9611e](https://github.com/dfa1/vortex-java/commit/a6a9611e))
- **Breaking — `Array` API.** `Array.truncate(rows)` renamed to `Array.limited(rows)` and made an abstract operation implemented by every array (composites slice their children); raw-segment access moved off the `ArraySegments` utility onto `Array.materialize(SegmentAllocator)` and `Array.segmentIfPresent()`. ([87ab65e2](https://github.com/dfa1/vortex-java/commit/87ab65e2), [4d9ac1f8](https://github.com/dfa1/vortex-java/commit/4d9ac1f8), [332b067e](https://github.com/dfa1/vortex-java/commit/332b067e), [32a35e03](https://github.com/dfa1/vortex-java/commit/32a35e03))
- CSV import reports progress every 10K rows instead of per-chunk. ([07a056e7](https://github.com/dfa1/vortex-java/commit/07a056e7))
Expand All @@ -207,7 +211,7 @@ Read and write Vortex Variant (semi-structured, JSON-shaped) columns from Java.

### Documentation

- [ADR 0016](docs/adr/0016-vortex-arrow-bridge.md): captures `vortex-arrow` bridge interop options (separate module / Arrow C-Data / none); deferred until a concrete downstream need. ([a6126f29](https://github.com/dfa1/vortex-java/commit/a6126f29))
- [ADR 0016](adr/0016-vortex-arrow-bridge.md): captures `vortex-arrow` bridge interop options (separate module / Arrow C-Data / none); deferred until a concrete downstream need. ([a6126f29](https://github.com/dfa1/vortex-java/commit/a6126f29))

### Tests

Expand Down
17 changes: 11 additions & 6 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,18 @@ Benchmark classes follow this: `JavaVsJni{Read,Write,Filter}Benchmark`,

```
core — everything lives under `io.github.dfa1.vortex.core.*`:
core.model DType, PType, TimeUnit, EncodingId, ExtensionId, TimeDtype, TimestampDtype
core.model DType, PType, TimeUnit, EncodingId, LayoutId, ExtensionId, TimeDtype, TimestampDtype
core.io IoBounds, PTypeIO, VortexFormat
core.error VortexException
core.compute FastLanes, PrimitiveArrays
core.fbs / core.proto — generated wire codecs + their runtimes
reader — VortexReader, VortexHttpReader, VortexHandle, ReadRegistry, Chunk, ArrayStats,
ScanOptions, RowFilter; file internals (Footer, Layout, Trailer,
PostscriptParser, …)
ScanOptions, RowFilter; file internals (Footer, Trailer, PostscriptParser, …)
reader.array — Array + all subtypes (decode outputs)
reader.decode — EncodingDecoder, DecodeContext, ArrayNode + *EncodingDecoder impls
reader.extension — ExtensionDecoder + Date/Time/Timestamp/Uuid impls
reader.layout — Layout, LayoutDecoder, LayoutDecodeContext, LayoutRegistry
+ built-in *LayoutDecoder impls, ZonedStatsSchema
writer — VortexWriter, WriteRegistry, WriteOptions, ExtensionEncoder
writer.encode — EncodingEncoder, EncodeContext, NullableData + *EncodingEncoder impls,
extension encoders
Expand Down Expand Up @@ -194,9 +195,13 @@ in the Rust source for the exact schema, then implement from spec.
not add variants. Use `new DType.Extension("ip.address", new DType.Primitive(PType.I32, false),
null, false)` and register decoders/encoders on the registries (or `ServiceLoader<ExtensionEncoder>`).
Mirrors Rust (`vortex.date`, `vortex.uuid`, …). No SPI for DType variants planned.
- **Layout is a fixed set, no SPI.** `ScanIterator.decodeLayout()` dispatches the known IDs
(flat/chunked/zoned/struct/dict) and throws otherwise. Keep the fixed set; revisit only for a
concrete downstream case unaddressable by a different flat-segment encoding.
- **Layout decode is pluggable via `LayoutDecoder` + `LayoutRegistry`** (`reader.layout`) — the
Rust reference registers layouts at runtime, so ours are open too. Builder-registered only
(`LayoutRegistry.builder().registerDefaults().register(custom).build()`, pass to
`VortexReader.open(path, readRegistry, layoutRegistry)`) — **no service file**. Unknown layouts
fail loudly (`VortexException`, Rust default; no allowUnknown for layouts). Scope: the SPI covers
full-column subtree decode; zone-map pruning, filtered scans, and chunk planning recognize the
built-in layouts only.
- **Small public APIs.** Don't expose internals — when in doubt, leave it out or make it private.
- **POM deps** grouped with comments: `<!-- production -->` then `<!-- testing -->`, each with
project-internal (`io.github.dfa1.vortex:*`) deps first, then external. Omit empty sections.
Expand Down
16 changes: 8 additions & 8 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@

## Performance

- [ ] **Benchmark publishing** — drop CI workflow, add `bench-publish` script; see [ADR-0006](docs/adr/0006-benchmark-publishing.md).
- [ ] **Benchmark publishing** — drop CI workflow, add `bench-publish` script; see [ADR-0006](adr/0006-benchmark-publishing.md).
- [ ] Performance tests must be peer-reviewed
- [ ] Run performance tests on other machines (I have access only to Apple M5)
- [ ] **Vector API adoption** — deferred; see [ADR-0005](docs/adr/0005-vector-api-adoption.md) for adoption criteria and candidate loops.
- [ ] **Vector API adoption** — deferred; see [ADR-0005](adr/0005-vector-api-adoption.md) for adoption criteria and candidate loops.

## Security

Expand Down Expand Up @@ -44,7 +44,7 @@ Per-encoding gotchas:

### Resource caps

- [ ] **Implement `ResourceLimits` + `ReadOptions`** — see [ADR-0004](docs/adr/0004-resource-caps-read-options.md) for design, defaults, and enforcement points. Also covers Pco page/bin caps.
- [ ] **Implement `ResourceLimits` + `ReadOptions`** — see [ADR-0004](adr/0004-resource-caps-read-options.md) for design, defaults, and enforcement points. Also covers Pco page/bin caps.

### Fuzz infrastructure

Expand All @@ -70,23 +70,23 @@ Per-encoding gotchas:

## Tooling

- [ ] Optional `vortex-arrow` bridge module for Arrow ecosystem interop — see [ADR-0016](docs/adr/0016-vortex-arrow-bridge.md)
- [ ] Optional `vortex-arrow` bridge module for Arrow ecosystem interop — see [ADR-0016](adr/0016-vortex-arrow-bridge.md)

## API

- [ ] **Error messages — structural sanitization of `VortexException`** —
Phase E (bounds typing via `IoBounds`) shipped; remaining is Phases A–D (the `Sanitize`
helper + `VortexError` catalog). See [ADR-0003](docs/adr/0003-vortex-exception-sanitization.md)
helper + `VortexError` catalog). See [ADR-0003](adr/0003-vortex-exception-sanitization.md)
for design and phasing.
- [ ] Use domain primitives (`UInt32`, `UInt64`, etc.) as value classes via Project Valhalla instead of raw `long`/`int`
- See [ADR-0008](docs/adr/0008-domain-primitives-unsigned-integers.md) and https://dfa1.github.io/articles/rethink-domain-primitives-with-valhalla
- See [ADR-0008](adr/0008-domain-primitives-unsigned-integers.md) and https://dfa1.github.io/articles/rethink-domain-primitives-with-valhalla
- Candidates: `PType` integer kinds, buffer offsets, row indices, byte lengths
- Goal: type-safety at zero cost (value class = no heap alloc, no boxing)

## Compute

- [ ] **Compute primitives — encoded-domain specialization & façade** — the remaining ADR-0013
follow-ups now the fused kernels have shipped. See [ADR-0013](docs/adr/0013-compute-primitives.md).
follow-ups now the fused kernels have shipped. See [ADR-0013](adr/0013-compute-primitives.md).
Done: §4 `Predicate`; §5 `RowFilter` unified over `Predicate`; §6 zone-map aggregate push-down in
both tiers — the whole-zone `ZoneReducer` fold wired into `VortexAggregatePushDownRule` (rewrites a
whole-table `MIN`/`MAX`/`COUNT`/`SUM`/`AVG` to a single-row `Values`, auto-registered over a bare
Expand All @@ -100,7 +100,7 @@ Per-encoding gotchas:
residual leaves tested per match). Multi-fork numbers: `fusedFilteredSumDict` 762 → 38 ms/op
≈ 20×; `fusedFilteredAggregateDict` 983 → 46 ms/op ≈ 22×; `fusedFilteredAggregateMulti`
(2-leaf `AND` × 2 aggregates) 2269 → 201 ms/op ≈ 11×.
Next: the columnar transducer façade — [ADR-0019](docs/adr/0019-columnar-transducer-facade.md)
Next: the columnar transducer façade — [ADR-0019](adr/0019-columnar-transducer-facade.md)
drafted (Proposed): declarative column-bound stages compiled to one fused pass; the remaining
measured lever is the multi-aggregate single scan (≈ 2×) plus composition ergonomics for the
Calcite boundary tier; review, then implement.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,6 @@ of CI / integration-test fallout, plus reviewer time. Not a weekend.

- [PR #27 — `sec(parser): BoundedSegment + audit trail for untrusted asSlice`](https://github.com/dfa1/vortex-java/pull/27)
- [Phase 1–4 commits — BoundedSegment introduction and migration](https://github.com/dfa1/vortex-java/pull/27/commits)
- [SECURITY.md — the contract this work hardens](../../SECURITY.md)
- [CLAUDE.md — current "three touch-points" rule for adding an encoding](../../CLAUDE.md)
- [TODO.md — parser hardening backlog](../../TODO.md)
- [SECURITY.md — the contract this work hardens](../SECURITY.md)
- [CLAUDE.md — current "three touch-points" rule for adding an encoding](../CLAUDE.md)
- [TODO.md — parser hardening backlog](../TODO.md)
Loading
Loading