Skip to content

Commit b602c39

Browse files
cpsievertclaude
andauthored
test: validate writer output against Vega-Lite v6 JSON Schema (#175)
* chore: add jsonschema dev-dependency for schema validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: vendor Vega-Lite v6 JSON Schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add assert_valid_vegalite() test helper with schema validation - Add jsonschema = "0.44" to dev-dependencies in src/Cargo.toml - Vendor the Vega-Lite v6 JSON Schema to src/writer/vegalite/schema/v6.json - Add sanitize_vegalite_schema() to rename non-URI-safe definition keys (angle brackets, parens, pipes, etc.) so jsonschema can compile the schema - Add LazyLock<Validator> VL_SCHEMA and assert_valid_vegalite() helper - Add test_schema_validation_catches_invalid_spec test that verifies a valid spec passes and an invalid mark value fails validation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test: validate all writer specs against Vega-Lite v6 schema Add assert_valid_vegalite() calls to 14 writer tests that produce full specs. 4 tests fail schema validation, revealing real writer bugs: - 3 facet tests: "height":"container" invalid on FacetSpec in VL v6 - 1 legend test: "values" not a valid legend property in VL v6 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(vegalite): don't emit bin=\"binned\" for non-positional encodings In VL v6, the \"binned\" string value for the `bin` property is only valid for positional channels (x, y). Non-positional channels such as color, size, and opacity only accept a boolean, a BinParams object, or null. Emitting `\"bin\": \"binned\"` on a color encoding therefore fails schema validation. Fix: guard the `bin: \"binned\"` assignment with `!is_binned_legend` so it is only added for positional aesthetics. Binned legend aesthetics already get the correct treatment via a threshold scale type and do not need the `bin` property at all. Also remove a duplicate `jsonschema` entry in src/Cargo.toml that was introduced while retrofitting existing tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: fix rustfmt formatting in sanitize_vegalite_schema Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * perf: only rewrite $ref strings when sanitizing VL schema Replace map_strings (which allocated for every string in the 1.8MB schema) with rewrite_refs that only visits $ref values in objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: extract VEGALITE_VERSION constant for schema URL Add a VEGALITE_VERSION constant ("v6") used by the writer to construct the $schema URL. The vendored schema file at schema/v6.json must match. A test verifies the constant and writer URL stay in sync — when bumping versions, update the constant, rename the vendored file, and update the include_str! path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: verify vendored schema matches upstream on each test run Download the schema from the writer's $schema URL and compare it to the vendored copy. Skips gracefully when offline. This catches drift between the vendored file and upstream Vega-Lite releases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: trim redundant comments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: use full URL constant VEGALITE_SCHEMA instead of version fragment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: use VEGALITE_SCHEMA constant directly in upstream check test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Run cargo fmt * fix: close unclosed test function and add missing #[test] attribute The test_vendored_schema_matches_upstream function was missing its closing brace, and test_secondary_channels_have_no_disallowed_properties was missing its #[test] attribute (since it was accidentally nested inside the previous function). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 21786f4 commit b602c39

4 files changed

Lines changed: 32628 additions & 18 deletions

File tree

src/Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,9 @@ tracing-subscriber = { workspace = true, optional = true }
7171
pyo3 = { workspace = true, optional = true }
7272

7373
[dev-dependencies]
74+
jsonschema = "0.44"
7475
proptest.workspace = true
76+
ureq = "3"
7577

7678
[features]
7779
default = ["duckdb", "sqlite", "vegalite", "ipc", "builtin-data"]

src/writer/vegalite/encoding.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -848,8 +848,8 @@ fn build_column_encoding(
848848
"type": field_type,
849849
});
850850

851-
// For binned scales, add bin: "binned" for proper axis tick placement
852-
if is_binned {
851+
// bin: "binned" is only valid for positional channels in VL v6
852+
if is_binned && !is_binned_legend {
853853
encoding["bin"] = json!("binned");
854854
}
855855

src/writer/vegalite/mod.rs

Lines changed: 164 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -358,26 +358,30 @@ fn apply_faceting(
358358

359359
vl_spec["facet"] = facet_def;
360360

361-
// Move layer into spec (data reference stays at top level)
361+
// Move layer + width/height into inner spec (FacetSpec disallows
362+
// top-level width/height in VL v6)
362363
let mut spec_inner = json!({});
363364
if let Some(layer) = vl_spec.get("layer") {
364365
spec_inner["layer"] = layer.clone();
365366
}
367+
if let Some(w) = vl_spec.get("width").cloned() {
368+
spec_inner["width"] = w;
369+
}
370+
if let Some(h) = vl_spec.get("height").cloned() {
371+
spec_inner["height"] = h;
372+
}
366373

367374
vl_spec["spec"] = spec_inner;
368-
vl_spec.as_object_mut().unwrap().remove("layer");
375+
let obj = vl_spec.as_object_mut().unwrap();
376+
obj.remove("layer");
377+
obj.remove("width");
378+
obj.remove("height");
369379

370-
// Apply scale resolution
371380
apply_facet_scale_resolution(vl_spec, &facet.properties, coord_kind);
372-
373-
// Apply additional properties (columns for wrap)
374381
apply_facet_properties(vl_spec, &facet.properties, true);
375382
}
376383
FacetLayout::Grid { row: _, column: _ } => {
377384
let mut facet_spec = serde_json::Map::new();
378-
379-
// Row facet: use internal aesthetic column "facet1"
380-
// Vega-Lite requires "row" key in the facet object
381385
let row_aes_col = naming::aesthetic_column("facet1");
382386
if facet_df.column(&row_aes_col).is_ok() {
383387
let row_scale = scales.iter().find(|s| s.aesthetic == "facet1");
@@ -406,19 +410,25 @@ fn apply_faceting(
406410

407411
vl_spec["facet"] = Value::Object(facet_spec);
408412

409-
// Move layer into spec (data reference stays at top level)
413+
// Move layer + width/height into inner spec (same as wrap above)
410414
let mut spec_inner = json!({});
411415
if let Some(layer) = vl_spec.get("layer") {
412416
spec_inner["layer"] = layer.clone();
413417
}
418+
if let Some(w) = vl_spec.get("width").cloned() {
419+
spec_inner["width"] = w;
420+
}
421+
if let Some(h) = vl_spec.get("height").cloned() {
422+
spec_inner["height"] = h;
423+
}
414424

415425
vl_spec["spec"] = spec_inner;
416-
vl_spec.as_object_mut().unwrap().remove("layer");
426+
let obj = vl_spec.as_object_mut().unwrap();
427+
obj.remove("layer");
428+
obj.remove("width");
429+
obj.remove("height");
417430

418-
// Apply scale resolution
419431
apply_facet_scale_resolution(vl_spec, &facet.properties, coord_kind);
420-
421-
// Apply additional properties (not columns for grid)
422432
apply_facet_properties(vl_spec, &facet.properties, false);
423433
}
424434
}
@@ -865,19 +875,21 @@ fn apply_facet_properties(
865875
}
866876
}
867877

878+
/// Vega-Lite schema URL. Update this, the vendored schema file, and
879+
/// the `include_str!` path in `VL_SCHEMA` when bumping versions.
880+
const VEGALITE_SCHEMA: &str = "https://vega.github.io/schema/vega-lite/v6.json";
881+
868882
/// Vega-Lite JSON writer
869883
///
870884
/// Generates Vega-Lite v6 specifications from ggsql specs and data.
871885
pub struct VegaLiteWriter {
872-
/// Vega-Lite schema version
873886
schema: String,
874887
}
875888

876889
impl VegaLiteWriter {
877-
/// Create a new Vega-Lite writer with default settings
878890
pub fn new() -> Self {
879891
Self {
880-
schema: "https://vega.github.io/schema/vega-lite/v6.json".to_string(),
892+
schema: VEGALITE_SCHEMA.to_string(),
881893
}
882894
}
883895

@@ -1078,13 +1090,101 @@ mod tests {
10781090
use crate::plot::{Labels, Layer, ParameterValue};
10791091
use crate::Geom;
10801092
use polars::prelude::*;
1093+
use serde_json::Value;
10811094
use std::collections::HashMap;
1095+
use std::sync::LazyLock;
10821096

10831097
// Re-export test functions from submodules
10841098
use super::data::find_bin_for_value;
10851099
use super::encoding::infer_field_type;
10861100
use super::layer::geom_to_mark;
10871101

1102+
fn sanitize_def_name(s: &str) -> String {
1103+
s.chars()
1104+
.map(|c| match c {
1105+
'<' | '>' | '(' | ')' | '|' | '[' | ']' | '"' | ' ' | '{' | '}' => '_',
1106+
_ => c,
1107+
})
1108+
.collect()
1109+
}
1110+
1111+
fn rewrite_refs(val: &mut Value) {
1112+
match val {
1113+
Value::Object(obj) => {
1114+
if let Some(Value::String(s)) = obj.get_mut("$ref") {
1115+
if let Some(defname) = s.strip_prefix("#/definitions/") {
1116+
let sanitized = sanitize_def_name(defname);
1117+
if sanitized != defname {
1118+
*s = format!("#/definitions/{}", sanitized);
1119+
}
1120+
}
1121+
}
1122+
for v in obj.values_mut() {
1123+
rewrite_refs(v);
1124+
}
1125+
}
1126+
Value::Array(arr) => arr.iter_mut().for_each(rewrite_refs),
1127+
_ => {}
1128+
}
1129+
}
1130+
1131+
/// The VL v6 schema uses TypeScript-style generics in definition names (e.g.,
1132+
/// `MarkPropDef<(Gradient|string|null)>`) which are invalid in `$ref` URIs.
1133+
/// Rename those definitions and rewrite their `$ref` references.
1134+
fn sanitize_vegalite_schema(schema: &mut Value) {
1135+
let defs = match schema
1136+
.get_mut("definitions")
1137+
.and_then(|d| d.as_object_mut())
1138+
{
1139+
Some(d) => d,
1140+
None => return,
1141+
};
1142+
1143+
let substitutions: Vec<(String, String)> = defs
1144+
.keys()
1145+
.filter(|k| sanitize_def_name(k) != **k)
1146+
.map(|k| (k.clone(), sanitize_def_name(k)))
1147+
.collect();
1148+
1149+
if substitutions.is_empty() {
1150+
return;
1151+
}
1152+
1153+
for (orig, sanitized) in &substitutions {
1154+
if let Some(val) = defs.remove(orig.as_str()) {
1155+
defs.insert(sanitized.clone(), val);
1156+
}
1157+
}
1158+
1159+
rewrite_refs(schema);
1160+
}
1161+
1162+
static VL_SCHEMA: LazyLock<jsonschema::Validator> = LazyLock::new(|| {
1163+
let mut schema: Value =
1164+
serde_json::from_str(include_str!("schema/v6.json")).expect("invalid schema JSON");
1165+
sanitize_vegalite_schema(&mut schema);
1166+
jsonschema::options()
1167+
.with_draft(jsonschema::Draft::Draft7)
1168+
.should_validate_formats(false)
1169+
.build(&schema)
1170+
.expect("invalid JSON Schema")
1171+
});
1172+
1173+
fn assert_valid_vegalite(json_str: &str) {
1174+
let spec: Value = serde_json::from_str(json_str).expect("invalid JSON");
1175+
let errors: Vec<String> = VL_SCHEMA
1176+
.iter_errors(&spec)
1177+
.map(|e| format!(" - {} (at {})", e, e.instance_path()))
1178+
.collect();
1179+
if !errors.is_empty() {
1180+
panic!(
1181+
"Vega-Lite schema validation failed ({} errors):\n{}",
1182+
errors.len(),
1183+
errors.join("\n")
1184+
);
1185+
}
1186+
}
1187+
10881188
/// Helper to wrap a DataFrame in a data map for testing (uses layer 0 key)
10891189
fn wrap_data(df: DataFrame) -> HashMap<String, DataFrame> {
10901190
wrap_data_for_layers(df, 1)
@@ -1305,6 +1405,7 @@ mod tests {
13051405
// Generate Vega-Lite JSON
13061406
transform_spec(&mut spec);
13071407
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
1408+
assert_valid_vegalite(&json_str);
13081409
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
13091410

13101411
// Verify structure (uses layer array with inline data)
@@ -1350,6 +1451,7 @@ mod tests {
13501451

13511452
transform_spec(&mut spec);
13521453
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
1454+
assert_valid_vegalite(&json_str);
13531455
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
13541456

13551457
assert_eq!(vl_spec["title"], "My Chart");
@@ -1385,6 +1487,7 @@ mod tests {
13851487

13861488
transform_spec(&mut spec);
13871489
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
1490+
assert_valid_vegalite(&json_str);
13881491
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
13891492

13901493
assert_eq!(vl_spec["layer"][0]["encoding"]["color"]["value"], "red");
@@ -1515,6 +1618,7 @@ mod tests {
15151618

15161619
transform_spec(&mut spec);
15171620
let json_str = writer.write(&spec, &wrap_data_for_layers(df, 2)).unwrap();
1621+
assert_valid_vegalite(&json_str);
15181622
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
15191623

15201624
// Should have layer array with 2 layers
@@ -1624,6 +1728,7 @@ mod tests {
16241728

16251729
transform_spec(&mut spec);
16261730
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
1731+
assert_valid_vegalite(&json_str);
16271732
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
16281733

16291734
// Check that labelExpr contains VL's range-style format
@@ -1719,6 +1824,7 @@ mod tests {
17191824
let result = writer.write(&spec, &wrap_data(simple_df()));
17201825
assert!(result.is_ok());
17211826
let json_str = result.unwrap();
1827+
assert_valid_vegalite(&json_str);
17221828
let json: Value = serde_json::from_str(&json_str).unwrap();
17231829

17241830
// Single-layer spec uses layer array structure
@@ -1747,6 +1853,7 @@ mod tests {
17471853
let result = writer.write(&spec, &wrap_data(simple_df()));
17481854
assert!(result.is_ok());
17491855
let json_str = result.unwrap();
1856+
assert_valid_vegalite(&json_str);
17501857
let json: Value = serde_json::from_str(&json_str).unwrap();
17511858

17521859
let encoding = &json["layer"][0]["encoding"];
@@ -1790,6 +1897,7 @@ mod tests {
17901897
let result = writer.write(&spec, &wrap_data(df));
17911898
assert!(result.is_ok());
17921899
let json_str = result.unwrap();
1900+
assert_valid_vegalite(&json_str);
17931901
let json: Value = serde_json::from_str(&json_str).unwrap();
17941902

17951903
let encoding = &json["layer"][0]["encoding"];
@@ -1829,6 +1937,7 @@ mod tests {
18291937
let result = writer.write(&spec, &wrap_data(df));
18301938
assert!(result.is_ok());
18311939
let json_str = result.unwrap();
1940+
assert_valid_vegalite(&json_str);
18321941
let json: Value = serde_json::from_str(&json_str).unwrap();
18331942

18341943
// Point has linetype => Null, should not appear in encoding
@@ -1853,6 +1962,7 @@ mod tests {
18531962
let result = writer.write(&spec, &wrap_data(simple_df()));
18541963
assert!(result.is_ok());
18551964
let json_str = result.unwrap();
1965+
assert_valid_vegalite(&json_str);
18561966
let json: Value = serde_json::from_str(&json_str).unwrap();
18571967

18581968
let encoding = &json["layer"][0]["encoding"];
@@ -1872,6 +1982,7 @@ mod tests {
18721982
let result = writer.write(&spec, &wrap_data(simple_df()));
18731983
assert!(result.is_ok());
18741984
let json_str = result.unwrap();
1985+
assert_valid_vegalite(&json_str);
18751986
let json: Value = serde_json::from_str(&json_str).unwrap();
18761987

18771988
let encoding = &json["layer"][0]["encoding"];
@@ -2181,6 +2292,7 @@ mod tests {
21812292

21822293
transform_spec(&mut spec);
21832294
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
2295+
assert_valid_vegalite(&json_str);
21842296
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
21852297

21862298
// Verify resolve.scale is set to independent for both axes
@@ -2264,6 +2376,7 @@ mod tests {
22642376

22652377
transform_spec(&mut spec);
22662378
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
2379+
assert_valid_vegalite(&json_str);
22672380
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
22682381

22692382
// Verify only y scale is independent
@@ -2344,6 +2457,7 @@ mod tests {
23442457

23452458
transform_spec(&mut spec);
23462459
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
2460+
assert_valid_vegalite(&json_str);
23472461
let vl_spec: Value = serde_json::from_str(&json_str).unwrap();
23482462

23492463
// Verify NO resolve.scale (fixed scales don't need it)
@@ -2364,6 +2478,40 @@ mod tests {
23642478
);
23652479
}
23662480

2481+
#[test]
2482+
fn test_schema_validation_catches_invalid_spec() {
2483+
let writer = VegaLiteWriter::new();
2484+
let mut spec = build_spec(Geom::point());
2485+
let df = simple_df();
2486+
transform_spec(&mut spec);
2487+
let json_str = writer.write(&spec, &wrap_data(df)).unwrap();
2488+
assert_valid_vegalite(&json_str);
2489+
2490+
let invalid = r#"{"$schema": "https://vega.github.io/schema/vega-lite/v6.json", "mark": "not_a_mark"}"#;
2491+
let result = std::panic::catch_unwind(|| assert_valid_vegalite(invalid));
2492+
assert!(result.is_err(), "invalid spec should fail validation");
2493+
}
2494+
2495+
#[test]
2496+
fn test_vendored_schema_matches_upstream() {
2497+
let response = match ureq::get(VEGALITE_SCHEMA).call() {
2498+
Ok(r) => r,
2499+
Err(_) => {
2500+
eprintln!("Skipping vendored schema check: could not reach {VEGALITE_SCHEMA}");
2501+
return;
2502+
}
2503+
};
2504+
let body = response.into_body().read_to_string().unwrap();
2505+
let upstream: Value = serde_json::from_str(&body).unwrap();
2506+
let vendored: Value =
2507+
serde_json::from_str(include_str!("schema/v6.json")).expect("invalid schema JSON");
2508+
assert_eq!(
2509+
upstream, vendored,
2510+
"Vendored schema does not match upstream at {VEGALITE_SCHEMA}. \
2511+
Re-download with: curl -sL '{VEGALITE_SCHEMA}' > src/writer/vegalite/schema/v6.json",
2512+
);
2513+
}
2514+
23672515
#[test]
23682516
fn test_secondary_channels_have_no_disallowed_properties() {
23692517
// Vega-Lite secondary channels (x2, y2, theta2, radius2) only support:

0 commit comments

Comments
 (0)