This repository contains JSON schema, JSON-LD frames, contexts, and SHACL rule sets for validating CDIF metadata documents.
- Files
- Quick Start
- Validation Workflow
- Conformance-URI-Driven Validation (ConformanceValidate.py)
- RO-Crate Conversion and Validation
- Croissant Conversion
- Usage Examples
- Context Requirements
- Authoring Instances Without Prefixes
- Schema Structure
- Flattened Graph Schema
- Troubleshooting
- Composite SHACL Shapes
- SHACL Validation
- DDI-CDI Resolved Schema
- Notes
| File | Description |
|---|---|
CDIFDiscoverySchema.json |
JSON Schema for framed (tree) CDIF discovery profile metadata, generated by generate_validation_schema.py from CDIFDiscoveryProfile resolvedSchema |
CDIFCompleteSchema.json |
JSON Schema for framed (tree) CDIF complete profile metadata (discovery + data description + archive + provenance), generated by generate_validation_schema.py from CDIFcompleteProfile resolvedSchema |
CDIFDataDescriptionSchema.json |
JSON Schema for framed (tree) CDIF data description profile metadata (discovery + data description), generated by generate_validation_schema.py from CDIFDataDescriptionProfile resolvedSchema |
generate_validation_schema.py |
Generates framed-tree validation schemas from building block profile resolved schemas |
CDIF-graph-schema-2026.json |
JSON Schema for flattened JSON-LD graphs (@graph arrays), generated by generate_graph_schema.py |
generate_graph_schema.py |
Generates the graph schema from building block source schemas |
ShaclValidation/generate_shacl_shapes.py |
Generates composite SHACL shapes from building block rules.shacl files |
ShaclValidation/generate_shacl_report.py |
Generates markdown SHACL validation reports with severity grouping |
ShaclValidation/CDIF-Discovery-Shapes.ttl |
Composite SHACL shapes for CDIFDiscovery profile (generated by ShaclValidation/generate_shacl_shapes.py) |
ShaclValidation/CDIF-Complete-Shapes.ttl |
Composite SHACL shapes for CDIFcomplete profile (generated by generate_shacl_shapes.py --profile complete) |
CDIF-frame-2026.jsonld |
JSON-LD frame for 2026 schema |
CDIF-context-2026.jsonld |
JSON-LD context for authoring without namespace prefixes |
FrameAndValidate.py |
Python script for framing and validation |
croissant/ConvertToCroissant.py |
Converts current-cdif: CDIF JSON-LD to Croissant 1.1 (mlcommons.org/croissant/1.1) format |
croissant/ConvertFromCroissant.py |
Converts Croissant JSON-LD to CDIF DataDescription (lossy inverse) — see croissant/CroissantToCDIF.md |
validate_building_blocks.py |
Validates building block schemas, SHACL shapes, and examples across the BB source tree |
validate-cdif.bat |
Windows batch script for oXygen XML Editor integration |
batch_validate.py |
Batch validation of CDIF metadata files across multiple file groups (JSON Schema + SHACL) |
ConformanceValidate.py |
Profile-agnostic validator: discovers the profiles a document claims via schema:subjectOf/dcterms:conformsTo and validates against each profile's JSON Schema + SHACL. --source w3id (fetch from the w3id redirector) or --source local (local schemas via conformance-schema-map.json). Accepts a single file or a directory (batch). Engine importable as run_conformance(...) |
conformance-schema-map.json |
Local URI→schema/SHACL map used by ConformanceValidate.py --source local |
geocodes_harvester.py |
Harvests dataset metadata from the EarthCube GeoCodes SPARQL endpoint, extracts original JSON-LD from landing pages, and optionally converts to CDIF core or discovery profile format |
DCAT/dcat_to_cdif.py |
Converts DCAT JSON-LD catalogs to CDIF schema.org format. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide. See DCAT/README.md |
DDI/ddi_to_cdif.py |
Converts DDI Codebook 2.5 XML (e.g., from Harvard Dataverse) to CDIF DataDescription JSON-LD: study-level metadata, <var> → schema:variableMeasured, <fileDscr> → schema:DataDownload/cdi:TabularTextDataSet, tab-file headers → physical mappings |
| File | Description |
|---|---|
ddi-cdi/ddi-cdi.schema_normative.json |
Full DDI-CDI normative JSON Schema (395 definitions) |
ddi-cdi/cls-InstanceVariable-resolved.json |
Self-contained resolved schema for DDI-CDI InstanceVariable class |
ddi-cdi/cls-InstanceVariable-resolved-README.md |
Documentation for the resolved schema generation process |
| File | Description |
|---|---|
CDIFDiscoverySchema.json |
Hand-maintained discovery schema (superseded by generated version) |
CDIFCompleteSchema.json |
Hand-maintained complete schema (superseded by generated version) |
CDIF-JSONLD-schema-2026.json |
Original all-in-one framed tree schema (superseded by CDIFDiscoverySchema + CDIFCompleteSchema) |
CDIF-JSONLD-schema-schemaprefix.json |
JSON Schema for CDIF Discovery profile metadata with schema: prefixes |
CDIF-frame.jsonld |
JSON-LD frame for legacy schema |
CDIF-context.jsonld |
Legacy JSON-LD context |
pip install PyLD jsonschema# Using Python script (default: 2026 schema)
python FrameAndValidate.py my-metadata.jsonld -v
# Using Windows batch script
validate-cdif.bat my-metadata.jsonldpython FrameAndValidate.py my-metadata.jsonld -o framed.json -vbatch_validate.py runs both JSON Schema and SHACL validation across multiple file groups:
python batch_validate.pyFile groups validated:
- testJSONMetadata -- 77 ADA metadata test files
- cdifbook -- 10 cdifbook example documents
- cdifProfiles -- 5 CDIF profile examples from building blocks
- adaProfiles -- 36 ADA profile examples from building blocks
Output shows per-file results for each validation type with severity-aware reporting:
- JSON Schema: PASS or FAIL
- SHACL: PASS (clean), PASS (N warnings, M info), FAIL (N violations, M warnings), or SKIP (for generated output files like
-croissant.json,-rocrate.json)
Group summaries and an overall summary list all violations and schema failures.
As of June 2026, validation across testJSONMetadata (77 files) and all 5 CDIF profile examples shows. Each record is validated against exactly the profiles its catalog record declares (per-declared-profile, via ConformanceValidate.py); the 77 testJSONMetadata declare core + discovery + manifest + provenance.
- JSON Schema: 77/77 testJSONMetadata pass the Discovery JSON Schema (the only framed-tree schema among their declared profiles; core, manifest, and provenance are validated by SHACL only)
- Profile examples: 5/5 pass (Discovery, DiscoveryMinimal, DiscoveryComplete, DataDescription, Complete)
- SHACL Violations: 0 across all files
- SHACL Warnings/Info: All files pass with warnings/info only — these reflect optional-but-recommended properties (missing activity descriptions, contact points, physical data types, etc.)
SHACL severity levels are aligned with JSON Schema: properties that are optional in the JSON Schema are sh:Warning (not sh:Violation) in SHACL.
CDIF metadata is expressed as JSON-LD. To validate JSON-LD documents against the JSON Schema, you need to first frame the document to ensure it has the correct structure. The framing process:
- Reshapes the JSON-LD graph into a tree structure
- Ensures properties use the expected prefixes (e.g.,
schema:name) - Embeds referenced nodes inline
- Normalizes arrays and single values
Use a JSON-LD processor to apply CDIF-frame-2026.jsonld to your metadata document.
Validate the framed output against the appropriate schema:
CDIFDiscoverySchema.json-- discovery profile onlyCDIFDataDescriptionSchema.json-- discovery + data descriptionCDIFCompleteSchema.json-- discovery + data description + archive + provenance (default)
ConformanceValidate.py is a profile-agnostic validator that discovers which
schemas to use from the instance document itself, rather than requiring you
to specify a profile up front.
- Reads the instance JSON-LD document.
- Extracts every
dcterms:conformsToURI fromschema:subjectOfentries that are taggedschema:additionalType: dcat:CatalogRecord. Entries without the CatalogRecord tag (e.g. a related publication carrying its ownconformsTo) are skipped — they don't declare profile conformance for the parent dataset. URIs with anada:(extension) prefix are ignored. - For each URI, resolves the JSON Schema and SHACL rules from the
selected source (
--source):w3id(default) — fetches<URI>/schemaand<URI>/shaclfrom thehttps://w3id.org/cdif/redirector (authoritative; needs network).local— looks the URI up in a JSON map file (--schema-map, defaultconformance-schema-map.jsonbeside the script) that points at local framed-tree schemas + SHACL shapes. Works offline. URIs absent from the map report asno_schema/no_shaclrather than failing.
- Frames + compacts the document with the CDIF output context (re-wrapping
schema:propertyID,schema:additionalType, etc. into arrays where the schemas expect them). - Validates against each profile's schema (and optionally SHACL) and prints per-profile pass/fail with attributed error messages.
The input may be a single file (per-profile report) or a directory
(batch mode — per-file PASS/FAIL lines plus an aggregate summary and an
error-pattern histogram). The validation engine is also importable:
run_conformance(doc, resolver, ...) returns a JSON-serializable results
dict (conformsTo, profiles[].schema/shacl.status/errors,
total_violations) so a web application can call it directly and pick the
resolver (W3idResolver or LocalResolver) per request.
conformance-schema-map.json maps conformsTo URIs to local schema/SHACL
files (paths relative to the map file; shacl optional; trailing-slash and
datadescription/data_description insensitive):
{
"https://w3id.org/cdif/discovery/1.0": {
"schema": "CDIFDiscoverySchema.json",
"shacl": "ShaclValidation/CDIF-Discovery-Shapes.ttl"
}
}The shipped map covers the three profiles that have framed-tree schemas in
this repo (discovery, data_description, complete). Other conformance URIs
(core, manifest, provenance, …) resolve as no_schema in local mode — use
--source w3id for the authoritative, complete set, or extend the map.
# Default (w3id source): both passes, cache fetched artifacts
python ConformanceValidate.py myrecord.jsonld --cache-dir .cache
# Local source (offline) — uses conformance-schema-map.json
python ConformanceValidate.py myrecord.jsonld --source local
# Local source with an explicit map
python ConformanceValidate.py myrecord.jsonld --source local --schema-map mymap.json
# Verbose — show every URI it discovers and every resolve
python ConformanceValidate.py myrecord.jsonld --verbose
# JSON Schema only (skip SHACL)
python ConformanceValidate.py myrecord.jsonld --no-shacl
# Batch: validate a whole directory, summary only
python ConformanceValidate.py ./testJSONMetadata --source local --summary
# (w3id only) Accept-header content negotiation on the bare URI instead of
# the /schema and /shacl sub-paths
python ConformanceValidate.py myrecord.jsonld --use-acceptPer-profile sections list violations like:
======================================================================
Profile: https://w3id.org/cdif/data_description/1.0
======================================================================
JSON Schema: PASSED
SHACL: 2 violation(s)
- cdif:InstanceVariable missing required cdif:name [path=cdif:name, focus=#var1]
- ...
A final summary reports total violations across all profiles. Exit code is 0 if no violations, 1 otherwise.
pip install pyld jsonschema pyshacl requests| Aspect | FrameAndValidate.py |
ConformanceValidate.py |
|---|---|---|
| Profile selection | User picks via --schema flag |
Discovered from the document's dcterms:conformsTo |
| Validation count | One profile per run | All profiles the document claims |
| Schema source | Local file path | Fetched from <URI>/schema via w3id |
| SHACL | Not built-in | Fetched from <URI>/shacl, validated via pyshacl |
| Use case | One profile, deep | Cross-profile sweep / conformance check |
When you know exactly which profile you want, use the per-profile
FrameAndValidate.py in each release repo (they have profile-specific
ARRAY_PROPERTIES lists tuned for that profile's idioms). When you want
to ask "what does this document actually conform to, and how well?", use
ConformanceValidate.py.
RO-Crate conversion and validation tools (ConvertToROCrate.py, ValidateROCrate.py) have been moved to the CDIF packaging repository. These tools convert nested/compacted CDIF JSON-LD into RO-Crate 1.1 form via JSON-LD expand + flatten.
See the packaging repository documentation for conversion details, validation checks, and usage.
Two converters live in croissant/. Forward (CDIF → Croissant) and inverse
(Croissant → CDIF DataDescription / Discovery).
The converters target Croissant 1.1 (http://mlcommons.org/croissant/1.1)
and the current cdif: CDIF schema; the inverse accepts Croissant 1.0 or 1.1.
# Forward: CDIF -> Croissant 1.1
python croissant/ConvertToCroissant.py input.jsonld -o output-croissant.json
python -c "import mlcroissant as mlc; mlc.Dataset(jsonld='output-croissant.json')" # optional
# Inverse: Croissant -> CDIF DataDescription / Discovery
python croissant/ConvertFromCroissant.py input-croissant.json -o output.jsonld
# then validate against the current Discovery / DataDescription profile schemaThe inverse is lossy — Croissant carries no equivalents for
prov:wasGeneratedBy, dqv:hasQualityMeasurement, schema:measurementTechnique,
schema:spatialCoverage/temporalCoverage, the CSVW table block, or the Data
Structure component roles. The script preserves anything the forward converter
passed through verbatim, reconstructs schema:identifier from a DOI in
citeAs/url, and maps cr:RecordSet.key → cdif:hasPrimaryKey. If the source
Croissant has no recordSet, the output validates against the Discovery schema
rather than the DataDescription schema (the appropriate profile in that case).
See croissant/README.md for detailed documentation on
both converters, property mappings, and example output. The full
property-by-property mappings are in croissant/CDIFtoCroissant.md
(forward) and croissant/CroissantToCDIF.md
(inverse).
The FrameAndValidate.py script handles the complete workflow:
# Validate with 2026 schema (default)
python FrameAndValidate.py my-metadata.jsonld -v
# Save framed output
python FrameAndValidate.py my-metadata.jsonld -o framed.json -v
# Use legacy schema
python FrameAndValidate.py my-metadata.jsonld --frame archive/CDIF-frame.jsonld --schema archive/CDIF-JSONLD-schema-schemaprefix.json -vOptions:
-v, --validate- Validate against JSON Schema-o, --output FILE- Save framed output to file--schema FILE- Path to JSON Schema (default: CDIFCompleteSchema.json)--frame FILE- Path to JSON-LD frame (default: CDIF-frame-2026.jsonld)
The validate-cdif.bat script enables validation from within oXygen XML Editor.
- Go to Tools → External Tools → Configure...
- Click New and configure:
| Field | Value |
|---|---|
| Name | CDIF Validate |
| Command | Path to validate-cdif.bat |
| Arguments | "${cf}" |
| Working directory | (leave empty) |
- Open a JSON-LD file in oXygen
- Go to Tools → External Tools → CDIF Validate
- Results appear in the oXygen console
validate-cdif.bat file.jsonld # Validate with 2026 schema
validate-cdif.bat file.jsonld --framed # Validate + save framed output
validate-cdif.bat file.jsonld --legacy # Use pre-2026 schema
validate-cdif.bat --help # Show helpimport json
from pyld import jsonld
import jsonschema
# Load the frame
with open('CDIF-frame-2026.jsonld') as f:
frame = json.load(f)
# Load your JSON-LD metadata document
with open('my-metadata.jsonld') as f:
doc = json.load(f)
# Load the schema
with open('CDIFCompleteSchema.json') as f:
schema = json.load(f)
# Step 1: Frame the document
framed = jsonld.frame(doc, frame)
# Step 2: Validate against schema
try:
jsonschema.validate(instance=framed, schema=schema)
print("Validation successful!")
except jsonschema.ValidationError as e:
print(f"Validation failed: {e.message}")Required packages:
pip install PyLD jsonschemaconst jsonld = require('jsonld');
const Ajv = require('ajv');
const addFormats = require('ajv-formats');
const fs = require('fs');
async function validateCDIF(metadataPath) {
// Load files
const frame = JSON.parse(fs.readFileSync('CDIF-frame-2026.jsonld', 'utf8'));
const doc = JSON.parse(fs.readFileSync(metadataPath, 'utf8'));
const schema = JSON.parse(fs.readFileSync('CDIFCompleteSchema.json', 'utf8'));
// Step 1: Frame the document
const framed = await jsonld.frame(doc, frame);
// Step 2: Validate against schema
const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const validate = ajv.compile(schema);
if (validate(framed)) {
console.log('Validation successful!');
return true;
} else {
console.log('Validation failed:', validate.errors);
return false;
}
}
validateCDIF('my-metadata.jsonld');Required packages:
npm install jsonld ajv ajv-formatsYour JSON-LD metadata documents must include a @context with namespace prefixes. Only schema and dcterms are required at the discovery level; additional prefixes are needed depending on which optional properties are used.
Required (discovery level):
{
"@context": {
"schema": "http://schema.org/",
"dcterms": "http://purl.org/dc/terms/"
}
}Optional prefixes (add as needed for the properties you use):
| Prefix | IRI | When needed |
|---|---|---|
spdx |
http://spdx.org/rdf/terms# |
Checksum properties on distributions |
dcat |
http://www.w3.org/ns/dcat# |
dcat:CatalogRecord on subjectOf |
geosparql |
http://www.opengis.net/ont/geosparql# |
Spatial coverage geometry |
prov |
http://www.w3.org/ns/prov# |
Provenance (wasGeneratedBy) |
dqv |
http://www.w3.org/ns/dqv# |
Data quality measurements |
cdi |
http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/ |
DDI-CDI variable/data structure properties |
csvw |
http://www.w3.org/ns/csvw# |
CSVW tabular data properties (data description level) |
Domain-specific metadata may also use extension namespace prefixes. For example, the XAS (X-ray absorption spectroscopy) test example uses:
| Prefix | IRI | Purpose |
|---|---|---|
xas |
http://cdi4exas.org/ |
XAS-specific types and properties (beamline, detector, edge energy, etc.) |
cdifq |
http://crossdomaininteroperability.org/cdifq/ |
Placeholder namespace for data structure properties (nColumns, nRows) not yet assigned to a formal vocabulary |
The cdifq namespace is a temporary placeholder. Properties using it (such as row/column counts on data structures) may migrate to DDI-CDI, CSVW, or another standard vocabulary in the future. croissant/ConvertToCroissant.py includes cdifq in its output context so that these terms resolve correctly during JSON-LD processing.
{
"@context": {
"schema": "http://schema.org/",
"dcterms": "http://purl.org/dc/terms/",
"prov": "http://www.w3.org/ns/prov#",
"dqv": "http://www.w3.org/ns/dqv#",
"geosparql": "http://www.opengis.net/ont/geosparql#",
"spdx": "http://spdx.org/rdf/terms#",
"time": "http://www.w3.org/2006/time#"
}
}If you prefer to author metadata without namespace prefixes (e.g., name instead of schema:name), you can use the CDIF-context-2026.jsonld context file. This context maps unprefixed property names to their full IRIs.
{
"@context": "https://your-server.org/CDIF-context-2026.jsonld",
"@type": "Dataset",
"@id": "https://example.org/dataset/123",
"name": "My Dataset",
"description": "A sample dataset description",
"identifier": "dataset-123",
"dateModified": "2024-01-15",
"url": "https://example.org/data/123",
"license": "https://creativecommons.org/licenses/by/4.0/",
"subjectOf": {
"@type": ["Dataset"],
"additionalType": ["dcat:CatalogRecord"],
"sdDatePublished": "2024-01-15"
}
}The validation workflow handles both prefixed and unprefixed instances:
- Unprefixed instance references
CDIF-context-2026.jsonld - Framing with
CDIF-frame-2026.jsonldtransforms the instance - The frame's context uses prefixed names, so the output has prefixed keys
- Validate against
CDIFCompleteSchema.json
This means you only need one schema. The framing step normalizes all instances to the prefixed format regardless of how they were authored.
For production use, host CDIF-context-2026.jsonld at a stable URL and reference it in your instances:
{
"@context": "https://your-server.org/CDIF-context-2026.jsonld",
...
}Or embed the context directly in your instance by copying the contents of CDIF-context-2026.jsonld.
The schema validates CDIF Discovery profile metadata with the following required fields:
@id- Resource identifier@type- Must includeschema:Dataset@context- JSON-LD context with required prefixesschema:name- Resource nameschema:identifier- Primary identifierschema:dateModified- Last modification dateschema:subjectOf- Metadata about the metadata record (requires@typecontainingschema:Datasetandschema:additionalTypecontainingdcat:CatalogRecord)- Either
schema:urlorschema:distribution- Access information - Either
schema:licenseorschema:conditionsOfAccess- Usage terms
The 2026 schema adds support for:
Variables (schema:variableMeasured):
- Items are
anyOfPropertyValue-based (cdifVariableMeasured) orschema:StatisticalVariable - PropertyValue variables: typed as
schema:PropertyValuewith DDI-CDI extensions (cdi:intendedDataType,cdif:simpleUnitOfMeasure,cdi:describedUnitOfMeasure,cdif:uses,cdif:role) cdif:role-- enum:UnitIdentifier,Measure,Attribute,Dimension,Descriptor,ReferenceVariable- StatisticalVariable: typed as
schema:StatisticalVariablewithschema:statType,schema:measuredProperty(required) cdif:physicalDataTypeis required at the data description level (CDIFDataDescription/CDIFcomplete profiles), not at discovery level
Distributions:
cdi:StructuredDataSet- For structured formats (JSON, XML, HDF5, NetCDF)cdi:TabularTextDataSet- For tabular text (wide format) with CSVW properties:csvw:delimiter,csvw:header,csvw:headerRowCountcdi:isDelimitedORcdi:isFixedWidthcdif:hasPhysicalMapping- Links variables to physical representation
cdi:LongStructureDataSet- For long/narrow data format where each row is a single observation:- A descriptor column identifies which variable each row measures (
cdif:role: Descriptor) - A reference column holds the actual value (
cdif:role: ReferenceVariable) - Optional CSVW properties (delimiter, header, etc.) and DDI-CDI physical properties
cdif:hasPhysicalMapping- Links variables to physical representation- The detailed LongDataStructure component cardinality (exactly one Identifier/VariableDescriptor/VariableValue component) is defined in the
cdifDataStructureprofile (data_structure/1.1), enforced in both JSON Schema (minContains/maxContains) and SHACL
- A descriptor column identifies which variable each row measures (
CDIF-graph-schema-2026.json is the graph-based counterpart to the framed tree schema. It validates flattened JSON-LD documents that use @graph arrays directly, without requiring framing first. This is useful for validating JSON-LD as it naturally comes out of RDF stores or JSON-LD flatten operations.
The schema is generated by generate_graph_schema.py from the CDIF building block source schemas.
The generator reads building block schemas from the metadataBuildingBlocks/_sources/ directory (the BuildingBlockSubmodule). The location is auto-detected or can be overridden:
# Auto-detect (looks for BuildingBlockSubmodule/_sources/ relative to script)
python generate_graph_schema.py
# Explicit path
python generate_graph_schema.py --bb-dir /path/to/_sources
# Environment variable
export CDIF_BB_DIR=/path/to/_sources
python generate_graph_schema.py
# Custom output path
python generate_graph_schema.py --output my-graph-schema.json# Validate a flattened JSON-LD document directly
python -c "
import json, jsonschema
with open('CDIF-graph-schema-2026.json') as f: schema = json.load(f)
with open('my-flattened.jsonld') as f: doc = json.load(f)
jsonschema.validate(doc, schema)
print('Valid')
"The graph schema accepts three input forms:
- A
{"@context": {...}, "@graph": [...]}document (the primary use case) - A bare array of typed objects
- A single typed object
The generated schema has this high-level structure:
root-graph: validates@contextprefix declarations +@grapharray of nodesroot-object: a nested if/then/else chain dispatching objects by@typeto the correct type definitionid-reference: shared{"@id": "string"}definition for cross-node references- 24 type definitions:
type-Dataset,type-Person,type-Organization,type-PropertyValue,type-DefinedTerm,type-CreativeWork,type-DataDownload,type-MediaObject,type-WebAPI,type-Action,type-HowTo,type-Place,type-ProperInterval,type-MonetaryGrant,type-Role,type-Activity,type-QualityMeasurement,type-Claim,type-CatalogRecord,type-Identifier,type-InstanceVariable,type-StructuredDataSet,type-TabularTextDataSet,type-LongStructureDataSet
Type dispatch is ordered most-specific-first (e.g., cdi:StructuredDataSet before schema:Dataset) so that subtypes are matched before their parent types.
The generator applies these transformations when reading building block source schemas:
- External
$refresolution -- Cross-building-block$refs (e.g.,../person/schema.yaml) are resolved to internal#/$defs/type-Xreferences anyOfalternatives -- Properties that reference other building block types getanyOf [type-ref, id-reference]so they accept either inline objects or@idcross-references@typedisambiguation -- Composite types get additional type markers for dispatch (e.g., cdifCatalogRecord becomesdcat:CatalogRecord, identifier addscdi:Identifier)@contextstripping -- Context declarations are removed from non-root types (the@contextgoes on the root-graph wrapper only)- Composite type assembly -- Complex types like
type-Datasetmerge mandatory + optional building blocks;type-StructuredDataSet/type-TabularTextDataSet/type-LongStructureDataSetcompose dataDownload + CDI extensions - Extended provenance --
type-Activitybuilt fromcdifProvbuilding block, requiring multi-typed@type: ["schema:Action", "prov:Activity"], merging basegeneratedByproperties (prov:used) with schema.org Action properties (schema:agent,schema:actionProcess, etc.). Instruments are nested withinprov:useditems viaschema:instrumentsub-key (instruments areprov:Entitysubclasses).type-HowToandtype-Claimadded as new dispatch types for methodology and assertion objects
-
Missing required property
- Ensure all required fields are present
- Check that
schema:subjectOfcontains required nested fields
-
Type mismatch
- Properties like
schema:spatialCoverageandschema:temporalCoverageexpect arrays - Check that
@typevalues use theschema:prefix
- Properties like
-
Invalid @type
- Root
@typemust includeschema:Dataset - For 2026 schema, variables must include both
schema:PropertyValueandcdi:InstanceVariable
- Root
-
Framing issues
- Ensure your document has proper
@idvalues for node references - Check that the
@contextis compatible with the frame
- Ensure your document has proper
-
dcterms:conformsTo syntax
- Must use object syntax:
[{"@id": "..."}]not["..."]
- Must use object syntax:
To see the framed output before validation:
python FrameAndValidate.py my-metadata.jsonld -o framed.jsonOr in Python:
framed = jsonld.frame(doc, frame)
print(json.dumps(framed, indent=2))In addition to JSON Schema validation, CDIF metadata can be validated using SHACL (Shapes Constraint Language) rules. SHACL validation operates on the RDF graph and can express constraints that JSON Schema cannot -- SPARQL-based targeting, cross-node relationships, and semantic inference.
The composite SHACL shapes are compiled from modular rules.shacl files in individual building blocks in the metadataBuildingBlocks repository. Two profiles are available:
- discovery —
ShaclValidation/CDIF-Discovery-Shapes.ttl(64 shapes) - complete —
ShaclValidation/CDIF-Complete-Shapes.ttl(76 shapes, adds provenance + data description)
Quick start:
# Validate against discovery shapes
python ShaclValidation/ShaclJSONLDContext.py my-metadata.jsonld ShaclValidation/CDIF-Discovery-Shapes.ttl
# Generate a markdown validation report
python ShaclValidation/generate_shacl_report.py my-metadata.jsonld ShaclValidation/CDIF-Complete-Shapes.ttl -o report.md
# Regenerate shapes after building block changes
python ShaclValidation/generate_shacl_shapes.py --profile discovery
python ShaclValidation/generate_shacl_shapes.py --profile completeSee ShaclValidation/README.md for detailed documentation on the SHACL tools, shapes architecture, report format, and how to add new building block shapes.
Recommendation: Use both JSON Schema and SHACL validation for comprehensive coverage. batch_validate.py runs both automatically across multiple file groups.
geocodes_harvester.py harvests dataset metadata from the EarthCube GeoCodes catalog (~170K indexed datasets). It queries the Blazegraph SPARQL endpoint, fetches original JSON-LD from source landing pages when available, and optionally converts records to CDIF profile format.
# List publishers and dataset counts
python geocodes_harvester.py --list-publishers
# Harvest 5 records from diverse publishers, convert to CDIF Discovery
python geocodes_harvester.py --count 5 --output ./examples --cdif discovery
# Harvest from a specific publisher
python geocodes_harvester.py --publisher "PANGAEA" --count 3 --output ./examples
# Harvest without CDIF conversion (raw schema.org JSON-LD)
python geocodes_harvester.py --count 5 --output ./raw-examplesThe CDIF conversion handles: property prefixing (schema:), @context/@type normalization, @list wrapping for creators, distribution fixes, subjectOf with conformsTo, type mappings (FundingAgency to Organization, Grant to MonetaryGrant, Croissant sc:Dataset to Dataset), Person name synthesis, and sameAs array normalization. All conversions are documented in each record's subjectOf description. Extra properties from the source are preserved (open-world assumption).
DCAT/dcat_to_cdif.py converts DCAT JSON-LD catalogs or individual dataset records to CDIF-conformant schema.org JSON-LD. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide.
# List datasets in a DCAT catalog
python DCAT/dcat_to_cdif.py catalog.jsonld --list
# Convert selected records, validate output
python DCAT/dcat_to_cdif.py catalog.jsonld --output ./examples \
--select 0,3,5 --catalog-name "My Catalog" --catalog-url "https://example.org/" \
--validateKey mappings: dcterms:title → schema:name, dcterms:description → schema:description, dcterms:modified → schema:dateModified, dcterms:license �� schema:license, dcterms:accessRights → schema:conditionsOfAccess, dcat:keyword → schema:keywords, dcat:Distribution → schema:DataDownload, dcterms:spatial → schema:spatialCoverage, dcterms:temporal → schema:temporalCoverage. Unmapped properties preserved (open world). Auto-detects Discovery vs Core profile based on spatial/temporal content.
See DCAT/README.md for the full property mapping table, PSDI catalog example, and known limitations.
DDI/ddi_to_cdif.py converts a DDI Codebook 2.5 XML file (for example, a DDI export from Harvard Dataverse) to a CDIF DataDescription JSON-LD document.
# Convert a DDI XML export (DOI is required)
python DDI/ddi_to_cdif.py input.xml --doi https://doi.org/10.7910/DVN/XXXXXX -o output.json
# Also fetch tab-file headers and file size/checksum from the Dataverse API
python DDI/ddi_to_cdif.py input.xml --doi https://doi.org/10.7910/DVN/XXXXXX \
--fetch-headers --fetch-file-meta -o output.jsonKey mappings: study titl/abstract → schema:name/schema:description, authors → schema:creator, keyword → schema:keywords, spatial/temporal coverage; DDI <var> → schema:variableMeasured (cdi:InstanceVariable); DDI <fileDscr> → schema:DataDownload (cdi:TabularTextDataSet) with CSVW properties; tab-file headers → physical mappings; caseQnty/varQnty → cdifq:nRows/nColumns. --fetch-headers and --fetch-file-meta pull headers and file size/checksum from the Dataverse API.
Note: The converter currently emits the pre-migration
cdi:-prefixed data-structure properties (cdi:role,cdi:physicalDataType,cdi:hasPhysicalMapping,cdi:index,cdi:formats_InstanceVariable). These should be migrated to the currentcdif:prefix before the output is treated as current-schema CDIF.
The MetadataExamples/ directory contains sample CDIF JSON-LD documents for testing:
| File | Technique | Description |
|---|---|---|
tof-htk9-f770.json |
ToF-SIMS | Time-of-flight mass spectrometry particle analysis |
xrd-2j0t-gq80.json |
XRD | X-ray diffraction |
xanes-2arx-b516.json |
XANES | X-ray absorption near-edge structure |
yv1f-jb20.json |
-- | General dataset |
test_se_na2so4-testschemaorg-cdiv3.json |
XAS | X-ray absorption spectroscopy with DDI-CDI data structure (WideDataStructure, InstanceVariable, ValueMapping). Uses xas: and cdifq: extension namespaces |
nwis-water-quality-longdata.json |
Water Quality | NWIS groundwater nutrient analysis (464 rows, 20 columns) in cdi:LongStructureDataSet long (narrow) format with Descriptor/ReferenceVariable roles, cdif:hasPhysicalMapping, and 5 Measure domain variables. Declares core/discovery/data_description/data_structure 1.1 conformance. Validates against graph schema (CDIF-graph-schema-2026.json) |
prov-ocean-temp-example.json |
Ocean Temperature | Extended provenance example demonstrating cdifProv building block: action chaining (schema:object/schema:result), multi-typed ["schema:Action", "prov:Activity"] activities, agents with Role wrappers, inline schema:HowTo methodology via schema:actionProcess with 3 steps, diverse instruments, facility location, and backward-compatible prov:used. Validates against graph schema |
Corresponding Croissant output files are in the croissant/ directory.
The ddi-cdi/cls-InstanceVariable-resolved.json file is a standalone JSON Schema (Draft 2020-12) for the DDI-CDI InstanceVariable class, derived from ddi-cdi/ddi-cdi.schema_normative.json. It resolves all $ref references into a self-contained schema suitable for use in editors like oXygen without needing the full 395-definition DDI-CDI schema.
The resolved schema applies several transformations to make the schema practical:
- Reverse properties removed - 767
_OF_reverse relationship properties stripped (use JSON-LD@reverseinstead) catalogDetailsremoved - Catalog-level metadata omitted from all classes- Redundant classes omitted -
cls-DataPoint,cls-Datum,cls-RepresentedVariablesimplified to IRI-only references - XSD types inlined - Primitive types (
xsd:string,xsd:integer, etc.) replaced with inline definitions - Patterns normalized -
if/then/elsearray patterns converted to consistentanyOf - Frequency-based
$refresolution - Common definitions (>3 uses) in$defs; rare definitions inlined
See ddi-cdi/cls-InstanceVariable-resolved-README.md for full details on the generation process, circular reference analysis, and transformation rationale.
- The framed tree schemas (
CDIFCompleteSchema.json,CDIFDiscoverySchema.json) are generated from building block profile resolved schemas usinggenerate_validation_schema.py. The hand-maintained originals and the all-in-oneCDIF-JSONLD-schema-2026.jsonare inarchive/. - Legacy schema (
CDIF-JSONLD-schema-schemaprefix.json) is still available for older documents. - All schema.org elements require the
schema:prefix for SHACL validation compatibility. - The frame ensures that after framing, the output structure matches what the JSON schema expects.
- For SHACL validation, use the corresponding
.shaclor.ttlfiles in this repository. @typeflexibility: All@typedefinitions in the framed schemas useanyOfto accept either a string ("schema:Dataset") or an array (["schema:Dataset"]). JSON-LD framing may compact single-element arrays to strings;FrameAndValidate.pyrecursively normalizes all@typevalues back to arrays.spdx:Checksumtyping: Allspdx:checksumobjects must include"@type": "spdx:Checksum". This is required by both the JSON Schema (required: ["@type"]) and SHACL shapes (sh:class spdx:Checksum).