Files for validation of CDIF metadata

This repository contains JSON schema, JSON-LD frames, contexts, and SHACL rule sets for validating CDIF metadata documents.

Files
Quick Start
Validation Workflow
- Step 1: Frame the JSON-LD Document
- Step 2: Validate Against Schema
Conformance-URI-Driven Validation (ConformanceValidate.py)
RO-Crate Conversion and Validation
Croissant Conversion
- How the Croissant Conversion Works
- Croissant Usage
Usage Examples
Context Requirements
Authoring Instances Without Prefixes
Schema Structure
Flattened Graph Schema
Troubleshooting
- Common Validation Errors
- Debugging
Composite SHACL Shapes
SHACL Validation
DDI-CDI Resolved Schema
Notes

Files

Current (2026 Schema with DDI-CDI/CSVW)

File	Description
`CDIFDiscoverySchema.json`	JSON Schema for framed (tree) CDIF discovery profile metadata, generated by `generate_validation_schema.py` from CDIFDiscoveryProfile resolvedSchema
`CDIFCompleteSchema.json`	JSON Schema for framed (tree) CDIF complete profile metadata (discovery + data description + archive + provenance), generated by `generate_validation_schema.py` from CDIFcompleteProfile resolvedSchema
`CDIFDataDescriptionSchema.json`	JSON Schema for framed (tree) CDIF data description profile metadata (discovery + data description), generated by `generate_validation_schema.py` from CDIFDataDescriptionProfile resolvedSchema
`generate_validation_schema.py`	Generates framed-tree validation schemas from building block profile resolved schemas
`CDIF-graph-schema-2026.json`	JSON Schema for flattened JSON-LD graphs (`@graph` arrays), generated by `generate_graph_schema.py`
`generate_graph_schema.py`	Generates the graph schema from building block source schemas
`ShaclValidation/generate_shacl_shapes.py`	Generates composite SHACL shapes from building block rules.shacl files
`ShaclValidation/generate_shacl_report.py`	Generates markdown SHACL validation reports with severity grouping
`ShaclValidation/CDIF-Discovery-Shapes.ttl`	Composite SHACL shapes for CDIFDiscovery profile (generated by `ShaclValidation/generate_shacl_shapes.py`)
`ShaclValidation/CDIF-Complete-Shapes.ttl`	Composite SHACL shapes for CDIFcomplete profile (generated by `generate_shacl_shapes.py --profile complete`)
`CDIF-frame-2026.jsonld`	JSON-LD frame for 2026 schema
`CDIF-context-2026.jsonld`	JSON-LD context for authoring without namespace prefixes
`FrameAndValidate.py`	Python script for framing and validation
`croissant/ConvertToCroissant.py`	Converts current-`cdif:` CDIF JSON-LD to Croissant 1.1 (mlcommons.org/croissant/1.1) format
`croissant/ConvertFromCroissant.py`	Converts Croissant JSON-LD to CDIF DataDescription (lossy inverse) — see `croissant/CroissantToCDIF.md`
`validate_building_blocks.py`	Validates building block schemas, SHACL shapes, and examples across the BB source tree
`validate-cdif.bat`	Windows batch script for oXygen XML Editor integration
`batch_validate.py`	Batch validation of CDIF metadata files across multiple file groups (JSON Schema + SHACL)
`ConformanceValidate.py`	Profile-agnostic validator: discovers the profiles a document claims via `schema:subjectOf/dcterms:conformsTo` and validates against each profile's JSON Schema + SHACL. `--source w3id` (fetch from the w3id redirector) or `--source local` (local schemas via `conformance-schema-map.json`). Accepts a single file or a directory (batch). Engine importable as `run_conformance(...)`
`conformance-schema-map.json`	Local URI→schema/SHACL map used by `ConformanceValidate.py --source local`
`geocodes_harvester.py`	Harvests dataset metadata from the EarthCube GeoCodes SPARQL endpoint, extracts original JSON-LD from landing pages, and optionally converts to CDIF core or discovery profile format
`DCAT/dcat_to_cdif.py`	Converts DCAT JSON-LD catalogs to CDIF schema.org format. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide. See DCAT/README.md
`DDI/ddi_to_cdif.py`	Converts DDI Codebook 2.5 XML (e.g., from Harvard Dataverse) to CDIF DataDescription JSON-LD: study-level metadata, `<var>` → `schema:variableMeasured`, `<fileDscr>` → `schema:DataDownload`/`cdi:TabularTextDataSet`, tab-file headers → physical mappings

DDI-CDI Resolved Schema

File	Description
`ddi-cdi/ddi-cdi.schema_normative.json`	Full DDI-CDI normative JSON Schema (395 definitions)
`ddi-cdi/cls-InstanceVariable-resolved.json`	Self-contained resolved schema for DDI-CDI InstanceVariable class
`ddi-cdi/cls-InstanceVariable-resolved-README.md`	Documentation for the resolved schema generation process

Legacy (Pre-2026, in `archive/`)

File	Description
`CDIFDiscoverySchema.json`	Hand-maintained discovery schema (superseded by generated version)
`CDIFCompleteSchema.json`	Hand-maintained complete schema (superseded by generated version)
`CDIF-JSONLD-schema-2026.json`	Original all-in-one framed tree schema (superseded by CDIFDiscoverySchema + CDIFCompleteSchema)
`CDIF-JSONLD-schema-schemaprefix.json`	JSON Schema for CDIF Discovery profile metadata with `schema:` prefixes
`CDIF-frame.jsonld`	JSON-LD frame for legacy schema
`CDIF-context.jsonld`	Legacy JSON-LD context

Quick Start

Prerequisites

pip install PyLD jsonschema

Validate a Document

# Using Python script (default: 2026 schema)
python FrameAndValidate.py my-metadata.jsonld -v

# Using Windows batch script
validate-cdif.bat my-metadata.jsonld

Save Framed Output for Debugging

python FrameAndValidate.py my-metadata.jsonld -o framed.json -v

Batch Validate Multiple Files

batch_validate.py runs both JSON Schema and SHACL validation across multiple file groups:

python batch_validate.py

File groups validated:

testJSONMetadata -- 77 ADA metadata test files
cdifbook -- 10 cdifbook example documents
cdifProfiles -- 5 CDIF profile examples from building blocks
adaProfiles -- 36 ADA profile examples from building blocks

Output shows per-file results for each validation type with severity-aware reporting:

JSON Schema: PASS or FAIL
SHACL: PASS (clean), PASS (N warnings, M info), FAIL (N violations, M warnings), or SKIP (for generated output files like -croissant.json, -rocrate.json)

Group summaries and an overall summary list all violations and schema failures.

Current Validation Status

As of June 2026, validation across testJSONMetadata (77 files) and all 5 CDIF profile examples shows. Each record is validated against exactly the profiles its catalog record declares (per-declared-profile, via ConformanceValidate.py); the 77 testJSONMetadata declare core + discovery + manifest + provenance.

JSON Schema: 77/77 testJSONMetadata pass the Discovery JSON Schema (the only framed-tree schema among their declared profiles; core, manifest, and provenance are validated by SHACL only)
Profile examples: 5/5 pass (Discovery, DiscoveryMinimal, DiscoveryComplete, DataDescription, Complete)
SHACL Violations: 0 across all files
SHACL Warnings/Info: All files pass with warnings/info only — these reflect optional-but-recommended properties (missing activity descriptions, contact points, physical data types, etc.)

SHACL severity levels are aligned with JSON Schema: properties that are optional in the JSON Schema are sh:Warning (not sh:Violation) in SHACL.

Validation Workflow

CDIF metadata is expressed as JSON-LD. To validate JSON-LD documents against the JSON Schema, you need to first frame the document to ensure it has the correct structure. The framing process:

Reshapes the JSON-LD graph into a tree structure
Ensures properties use the expected prefixes (e.g., schema:name)
Embeds referenced nodes inline
Normalizes arrays and single values

Step 1: Frame the JSON-LD Document

Use a JSON-LD processor to apply CDIF-frame-2026.jsonld to your metadata document.

Step 2: Validate Against Schema

Validate the framed output against the appropriate schema:

CDIFDiscoverySchema.json -- discovery profile only
CDIFDataDescriptionSchema.json -- discovery + data description
CDIFCompleteSchema.json -- discovery + data description + archive + provenance (default)

Conformance-URI-Driven Validation (`ConformanceValidate.py`)

ConformanceValidate.py is a profile-agnostic validator that discovers which schemas to use from the instance document itself, rather than requiring you to specify a profile up front.

How it works

Reads the instance JSON-LD document.
Extracts every dcterms:conformsTo URI from schema:subjectOf entries that are tagged schema:additionalType: dcat:CatalogRecord. Entries without the CatalogRecord tag (e.g. a related publication carrying its own conformsTo) are skipped — they don't declare profile conformance for the parent dataset. URIs with an ada: (extension) prefix are ignored.
For each URI, resolves the JSON Schema and SHACL rules from the selected source (--source):
- w3id (default) — fetches <URI>/schema and <URI>/shacl from the https://w3id.org/cdif/ redirector (authoritative; needs network).
- local — looks the URI up in a JSON map file (--schema-map, default conformance-schema-map.json beside the script) that points at local framed-tree schemas + SHACL shapes. Works offline. URIs absent from the map report as no_schema/no_shacl rather than failing.
Frames + compacts the document with the CDIF output context (re-wrapping schema:propertyID, schema:additionalType, etc. into arrays where the schemas expect them).
Validates against each profile's schema (and optionally SHACL) and prints per-profile pass/fail with attributed error messages.

The input may be a single file (per-profile report) or a directory (batch mode — per-file PASS/FAIL lines plus an aggregate summary and an error-pattern histogram). The validation engine is also importable: run_conformance(doc, resolver, ...) returns a JSON-serializable results dict (conformsTo, profiles[].schema/shacl.status/errors, total_violations) so a web application can call it directly and pick the resolver (W3idResolver or LocalResolver) per request.

Local schema map

conformance-schema-map.json maps conformsTo URIs to local schema/SHACL files (paths relative to the map file; shacl optional; trailing-slash and datadescription/data_description insensitive):

{
  "https://w3id.org/cdif/discovery/1.0": {
    "schema": "CDIFDiscoverySchema.json",
    "shacl":  "ShaclValidation/CDIF-Discovery-Shapes.ttl"
  }
}

The shipped map covers the three profiles that have framed-tree schemas in this repo (discovery, data_description, complete). Other conformance URIs (core, manifest, provenance, …) resolve as no_schema in local mode — use --source w3id for the authoritative, complete set, or extend the map.

Quick start

# Default (w3id source): both passes, cache fetched artifacts
python ConformanceValidate.py myrecord.jsonld --cache-dir .cache

# Local source (offline) — uses conformance-schema-map.json
python ConformanceValidate.py myrecord.jsonld --source local

# Local source with an explicit map
python ConformanceValidate.py myrecord.jsonld --source local --schema-map mymap.json

# Verbose — show every URI it discovers and every resolve
python ConformanceValidate.py myrecord.jsonld --verbose

# JSON Schema only (skip SHACL)
python ConformanceValidate.py myrecord.jsonld --no-shacl

# Batch: validate a whole directory, summary only
python ConformanceValidate.py ./testJSONMetadata --source local --summary

# (w3id only) Accept-header content negotiation on the bare URI instead of
# the /schema and /shacl sub-paths
python ConformanceValidate.py myrecord.jsonld --use-accept

Output

Per-profile sections list violations like:

======================================================================
Profile: https://w3id.org/cdif/data_description/1.0
======================================================================

  JSON Schema: PASSED

  SHACL: 2 violation(s)
    - cdif:InstanceVariable missing required cdif:name  [path=cdif:name, focus=#var1]
    - ...

A final summary reports total violations across all profiles. Exit code is 0 if no violations, 1 otherwise.

Dependencies

pip install pyld jsonschema pyshacl requests

Differences from `FrameAndValidate.py`

Aspect	`FrameAndValidate.py`	`ConformanceValidate.py`
Profile selection	User picks via `--schema` flag	Discovered from the document's `dcterms:conformsTo`
Validation count	One profile per run	All profiles the document claims
Schema source	Local file path	Fetched from `<URI>/schema` via w3id
SHACL	Not built-in	Fetched from `<URI>/shacl`, validated via pyshacl
Use case	One profile, deep	Cross-profile sweep / conformance check

When you know exactly which profile you want, use the per-profile FrameAndValidate.py in each release repo (they have profile-specific ARRAY_PROPERTIES lists tuned for that profile's idioms). When you want to ask "what does this document actually conform to, and how well?", use ConformanceValidate.py.

RO-Crate Conversion and Validation

RO-Crate conversion and validation tools (ConvertToROCrate.py, ValidateROCrate.py) have been moved to the CDIF packaging repository. These tools convert nested/compacted CDIF JSON-LD into RO-Crate 1.1 form via JSON-LD expand + flatten.

See the packaging repository documentation for conversion details, validation checks, and usage.

Croissant Conversion

Two converters live in croissant/. Forward (CDIF → Croissant) and inverse (Croissant → CDIF DataDescription / Discovery).

The converters target Croissant 1.1 (http://mlcommons.org/croissant/1.1) and the current cdif: CDIF schema; the inverse accepts Croissant 1.0 or 1.1.

# Forward: CDIF -> Croissant 1.1
python croissant/ConvertToCroissant.py input.jsonld -o output-croissant.json
python -c "import mlcroissant as mlc; mlc.Dataset(jsonld='output-croissant.json')"  # optional

# Inverse: Croissant -> CDIF DataDescription / Discovery
python croissant/ConvertFromCroissant.py input-croissant.json -o output.jsonld
# then validate against the current Discovery / DataDescription profile schema

The inverse is lossy — Croissant carries no equivalents for prov:wasGeneratedBy, dqv:hasQualityMeasurement, schema:measurementTechnique, schema:spatialCoverage/temporalCoverage, the CSVW table block, or the Data Structure component roles. The script preserves anything the forward converter passed through verbatim, reconstructs schema:identifier from a DOI in citeAs/url, and maps cr:RecordSet.key → cdif:hasPrimaryKey. If the source Croissant has no recordSet, the output validates against the Discovery schema rather than the DataDescription schema (the appropriate profile in that case).

See croissant/README.md for detailed documentation on both converters, property mappings, and example output. The full property-by-property mappings are in croissant/CDIFtoCroissant.md (forward) and croissant/CroissantToCDIF.md (inverse).

Usage Examples

Command Line (Recommended)

The FrameAndValidate.py script handles the complete workflow:

# Validate with 2026 schema (default)
python FrameAndValidate.py my-metadata.jsonld -v

# Save framed output
python FrameAndValidate.py my-metadata.jsonld -o framed.json -v

# Use legacy schema
python FrameAndValidate.py my-metadata.jsonld --frame archive/CDIF-frame.jsonld --schema archive/CDIF-JSONLD-schema-schemaprefix.json -v

Options:

-v, --validate - Validate against JSON Schema
-o, --output FILE - Save framed output to file
--schema FILE - Path to JSON Schema (default: CDIFCompleteSchema.json)
--frame FILE - Path to JSON-LD frame (default: CDIF-frame-2026.jsonld)

oXygen XML Editor

The validate-cdif.bat script enables validation from within oXygen XML Editor.

Setup

Go to Tools → External Tools → Configure...
Click New and configure:

Field	Value
Name	`CDIF Validate`
Command	Path to `validate-cdif.bat`
Arguments	`"${cf}"`
Working directory	(leave empty)

Usage

Open a JSON-LD file in oXygen
Go to Tools → External Tools → CDIF Validate
Results appear in the oXygen console

Batch Script Options

validate-cdif.bat file.jsonld           # Validate with 2026 schema
validate-cdif.bat file.jsonld --framed  # Validate + save framed output
validate-cdif.bat file.jsonld --legacy  # Use pre-2026 schema
validate-cdif.bat --help                # Show help

Python

import json
from pyld import jsonld
import jsonschema

# Load the frame
with open('CDIF-frame-2026.jsonld') as f:
    frame = json.load(f)

# Load your JSON-LD metadata document
with open('my-metadata.jsonld') as f:
    doc = json.load(f)

# Load the schema
with open('CDIFCompleteSchema.json') as f:
    schema = json.load(f)

# Step 1: Frame the document
framed = jsonld.frame(doc, frame)

# Step 2: Validate against schema
try:
    jsonschema.validate(instance=framed, schema=schema)
    print("Validation successful!")
except jsonschema.ValidationError as e:
    print(f"Validation failed: {e.message}")

Required packages:

pip install PyLD jsonschema

JavaScript/Node.js

const jsonld = require('jsonld');
const Ajv = require('ajv');
const addFormats = require('ajv-formats');
const fs = require('fs');

async function validateCDIF(metadataPath) {
    // Load files
    const frame = JSON.parse(fs.readFileSync('CDIF-frame-2026.jsonld', 'utf8'));
    const doc = JSON.parse(fs.readFileSync(metadataPath, 'utf8'));
    const schema = JSON.parse(fs.readFileSync('CDIFCompleteSchema.json', 'utf8'));

    // Step 1: Frame the document
    const framed = await jsonld.frame(doc, frame);

    // Step 2: Validate against schema
    const ajv = new Ajv({ allErrors: true });
    addFormats(ajv);
    const validate = ajv.compile(schema);

    if (validate(framed)) {
        console.log('Validation successful!');
        return true;
    } else {
        console.log('Validation failed:', validate.errors);
        return false;
    }
}

validateCDIF('my-metadata.jsonld');

Required packages:

npm install jsonld ajv ajv-formats

Context Requirements

Your JSON-LD metadata documents must include a @context with namespace prefixes. Only schema and dcterms are required at the discovery level; additional prefixes are needed depending on which optional properties are used.

2026 Schema Requirements

Required (discovery level):

{
    "@context": {
        "schema": "http://schema.org/",
        "dcterms": "http://purl.org/dc/terms/"
    }
}

Optional prefixes (add as needed for the properties you use):

Prefix	IRI	When needed
`spdx`	`http://spdx.org/rdf/terms#`	Checksum properties on distributions
`dcat`	`http://www.w3.org/ns/dcat#`	`dcat:CatalogRecord` on subjectOf
`geosparql`	`http://www.opengis.net/ont/geosparql#`	Spatial coverage geometry
`prov`	`http://www.w3.org/ns/prov#`	Provenance (wasGeneratedBy)
`dqv`	`http://www.w3.org/ns/dqv#`	Data quality measurements
`cdi`	`http://ddialliance.org/Specification/DDI-CDI/1.0/RDF/`	DDI-CDI variable/data structure properties
`csvw`	`http://www.w3.org/ns/csvw#`	CSVW tabular data properties (data description level)

Domain-specific metadata may also use extension namespace prefixes. For example, the XAS (X-ray absorption spectroscopy) test example uses:

Prefix	IRI	Purpose
`xas`	`http://cdi4exas.org/`	XAS-specific types and properties (beamline, detector, edge energy, etc.)
`cdifq`	`http://crossdomaininteroperability.org/cdifq/`	Placeholder namespace for data structure properties (`nColumns`, `nRows`) not yet assigned to a formal vocabulary

The cdifq namespace is a temporary placeholder. Properties using it (such as row/column counts on data structures) may migrate to DDI-CDI, CSVW, or another standard vocabulary in the future. croissant/ConvertToCroissant.py includes cdifq in its output context so that these terms resolve correctly during JSON-LD processing.

Legacy Schema Requirements

{
    "@context": {
        "schema": "http://schema.org/",
        "dcterms": "http://purl.org/dc/terms/",
        "prov": "http://www.w3.org/ns/prov#",
        "dqv": "http://www.w3.org/ns/dqv#",
        "geosparql": "http://www.opengis.net/ont/geosparql#",
        "spdx": "http://spdx.org/rdf/terms#",
        "time": "http://www.w3.org/2006/time#"
    }
}

Authoring Instances Without Prefixes

If you prefer to author metadata without namespace prefixes (e.g., name instead of schema:name), you can use the CDIF-context-2026.jsonld context file. This context maps unprefixed property names to their full IRIs.

Example Instance Without Prefixes

{
    "@context": "https://your-server.org/CDIF-context-2026.jsonld",
    "@type": "Dataset",
    "@id": "https://example.org/dataset/123",
    "name": "My Dataset",
    "description": "A sample dataset description",
    "identifier": "dataset-123",
    "dateModified": "2024-01-15",
    "url": "https://example.org/data/123",
    "license": "https://creativecommons.org/licenses/by/4.0/",
    "subjectOf": {
        "@type": ["Dataset"],
        "additionalType": ["dcat:CatalogRecord"],
        "sdDatePublished": "2024-01-15"
    }
}

How It Works

The validation workflow handles both prefixed and unprefixed instances:

Unprefixed instance references CDIF-context-2026.jsonld
Framing with CDIF-frame-2026.jsonld transforms the instance
The frame's context uses prefixed names, so the output has prefixed keys
Validate against CDIFCompleteSchema.json

This means you only need one schema. The framing step normalizes all instances to the prefixed format regardless of how they were authored.

Deploying the Context

For production use, host CDIF-context-2026.jsonld at a stable URL and reference it in your instances:

{
    "@context": "https://your-server.org/CDIF-context-2026.jsonld",
    ...
}

Or embed the context directly in your instance by copying the contents of CDIF-context-2026.jsonld.

Schema Structure

The schema validates CDIF Discovery profile metadata with the following required fields:

@id - Resource identifier
@type - Must include schema:Dataset
@context - JSON-LD context with required prefixes
schema:name - Resource name
schema:identifier - Primary identifier
schema:dateModified - Last modification date
schema:subjectOf - Metadata about the metadata record (requires @type containing schema:Dataset and schema:additionalType containing dcat:CatalogRecord)
Either schema:url or schema:distribution - Access information
Either schema:license or schema:conditionsOfAccess - Usage terms

2026 Schema Additions

The 2026 schema adds support for:

Variables (schema:variableMeasured):

Items are anyOf PropertyValue-based (cdifVariableMeasured) or schema:StatisticalVariable
PropertyValue variables: typed as schema:PropertyValue with DDI-CDI extensions (cdi:intendedDataType, cdif:simpleUnitOfMeasure, cdi:describedUnitOfMeasure, cdif:uses, cdif:role)
cdif:role -- enum: UnitIdentifier, Measure, Attribute, Dimension, Descriptor, ReferenceVariable
StatisticalVariable: typed as schema:StatisticalVariable with schema:statType, schema:measuredProperty (required)
cdif:physicalDataType is required at the data description level (CDIFDataDescription/CDIFcomplete profiles), not at discovery level

Distributions:

cdi:StructuredDataSet - For structured formats (JSON, XML, HDF5, NetCDF)
cdi:TabularTextDataSet - For tabular text (wide format) with CSVW properties:
- csvw:delimiter, csvw:header, csvw:headerRowCount
- cdi:isDelimited OR cdi:isFixedWidth
- cdif:hasPhysicalMapping - Links variables to physical representation
cdi:LongStructureDataSet - For long/narrow data format where each row is a single observation:
- A descriptor column identifies which variable each row measures (cdif:role: Descriptor)
- A reference column holds the actual value (cdif:role: ReferenceVariable)
- Optional CSVW properties (delimiter, header, etc.) and DDI-CDI physical properties
- cdif:hasPhysicalMapping - Links variables to physical representation
- The detailed LongDataStructure component cardinality (exactly one Identifier/VariableDescriptor/VariableValue component) is defined in the cdifDataStructure profile (data_structure/1.1), enforced in both JSON Schema (minContains/maxContains) and SHACL

Flattened Graph Schema

CDIF-graph-schema-2026.json is the graph-based counterpart to the framed tree schema. It validates flattened JSON-LD documents that use @graph arrays directly, without requiring framing first. This is useful for validating JSON-LD as it naturally comes out of RDF stores or JSON-LD flatten operations.

The schema is generated by generate_graph_schema.py from the CDIF building block source schemas.

Building Block Sources

The generator reads building block schemas from the metadataBuildingBlocks/_sources/ directory (the BuildingBlockSubmodule). The location is auto-detected or can be overridden:

# Auto-detect (looks for BuildingBlockSubmodule/_sources/ relative to script)
python generate_graph_schema.py

# Explicit path
python generate_graph_schema.py --bb-dir /path/to/_sources

# Environment variable
export CDIF_BB_DIR=/path/to/_sources
python generate_graph_schema.py

# Custom output path
python generate_graph_schema.py --output my-graph-schema.json

Graph Schema Usage

# Validate a flattened JSON-LD document directly
python -c "
import json, jsonschema
with open('CDIF-graph-schema-2026.json') as f: schema = json.load(f)
with open('my-flattened.jsonld') as f: doc = json.load(f)
jsonschema.validate(doc, schema)
print('Valid')
"

The graph schema accepts three input forms:

A {"@context": {...}, "@graph": [...]} document (the primary use case)
A bare array of typed objects
A single typed object

Schema Structure (Graph)

The generated schema has this high-level structure:

root-graph: validates @context prefix declarations + @graph array of nodes
root-object: a nested if/then/else chain dispatching objects by @type to the correct type definition
id-reference: shared {"@id": "string"} definition for cross-node references
24 type definitions: type-Dataset, type-Person, type-Organization, type-PropertyValue, type-DefinedTerm, type-CreativeWork, type-DataDownload, type-MediaObject, type-WebAPI, type-Action, type-HowTo, type-Place, type-ProperInterval, type-MonetaryGrant, type-Role, type-Activity, type-QualityMeasurement, type-Claim, type-CatalogRecord, type-Identifier, type-InstanceVariable, type-StructuredDataSet, type-TabularTextDataSet, type-LongStructureDataSet

Type dispatch is ordered most-specific-first (e.g., cdi:StructuredDataSet before schema:Dataset) so that subtypes are matched before their parent types.

Key Transformations

The generator applies these transformations when reading building block source schemas:

External $ref resolution -- Cross-building-block $refs (e.g., ../person/schema.yaml) are resolved to internal #/$defs/type-X references
anyOf alternatives -- Properties that reference other building block types get anyOf [type-ref, id-reference] so they accept either inline objects or @id cross-references
@type disambiguation -- Composite types get additional type markers for dispatch (e.g., cdifCatalogRecord becomes dcat:CatalogRecord, identifier adds cdi:Identifier)
@context stripping -- Context declarations are removed from non-root types (the @context goes on the root-graph wrapper only)
Composite type assembly -- Complex types like type-Dataset merge mandatory + optional building blocks; type-StructuredDataSet/type-TabularTextDataSet/type-LongStructureDataSet compose dataDownload + CDI extensions
Extended provenance -- type-Activity built from cdifProv building block, requiring multi-typed @type: ["schema:Action", "prov:Activity"], merging base generatedBy properties (prov:used) with schema.org Action properties (schema:agent, schema:actionProcess, etc.). Instruments are nested within prov:used items via schema:instrument sub-key (instruments are prov:Entity subclasses). type-HowTo and type-Claim added as new dispatch types for methodology and assertion objects

Troubleshooting

Common Validation Errors

Missing required property
- Ensure all required fields are present
- Check that schema:subjectOf contains required nested fields
Type mismatch
- Properties like schema:spatialCoverage and schema:temporalCoverage expect arrays
- Check that @type values use the schema: prefix
Invalid @type
- Root @type must include schema:Dataset
- For 2026 schema, variables must include both schema:PropertyValue and cdi:InstanceVariable
Framing issues
- Ensure your document has proper @id values for node references
- Check that the @context is compatible with the frame
dcterms:conformsTo syntax
- Must use object syntax: [{"@id": "..."}] not ["..."]

Debugging

To see the framed output before validation:

python FrameAndValidate.py my-metadata.jsonld -o framed.json

Or in Python:

framed = jsonld.frame(doc, frame)
print(json.dumps(framed, indent=2))

SHACL Validation

In addition to JSON Schema validation, CDIF metadata can be validated using SHACL (Shapes Constraint Language) rules. SHACL validation operates on the RDF graph and can express constraints that JSON Schema cannot -- SPARQL-based targeting, cross-node relationships, and semantic inference.

The composite SHACL shapes are compiled from modular rules.shacl files in individual building blocks in the metadataBuildingBlocks repository. Two profiles are available:

discovery — ShaclValidation/CDIF-Discovery-Shapes.ttl (64 shapes)
complete — ShaclValidation/CDIF-Complete-Shapes.ttl (76 shapes, adds provenance + data description)

Quick start:

# Validate against discovery shapes
python ShaclValidation/ShaclJSONLDContext.py my-metadata.jsonld ShaclValidation/CDIF-Discovery-Shapes.ttl

# Generate a markdown validation report
python ShaclValidation/generate_shacl_report.py my-metadata.jsonld ShaclValidation/CDIF-Complete-Shapes.ttl -o report.md

# Regenerate shapes after building block changes
python ShaclValidation/generate_shacl_shapes.py --profile discovery
python ShaclValidation/generate_shacl_shapes.py --profile complete

See ShaclValidation/README.md for detailed documentation on the SHACL tools, shapes architecture, report format, and how to add new building block shapes.

Recommendation: Use both JSON Schema and SHACL validation for comprehensive coverage. batch_validate.py runs both automatically across multiple file groups.

GeoCodes Harvester

geocodes_harvester.py harvests dataset metadata from the EarthCube GeoCodes catalog (~170K indexed datasets). It queries the Blazegraph SPARQL endpoint, fetches original JSON-LD from source landing pages when available, and optionally converts records to CDIF profile format.

# List publishers and dataset counts
python geocodes_harvester.py --list-publishers

# Harvest 5 records from diverse publishers, convert to CDIF Discovery
python geocodes_harvester.py --count 5 --output ./examples --cdif discovery

# Harvest from a specific publisher
python geocodes_harvester.py --publisher "PANGAEA" --count 3 --output ./examples

# Harvest without CDIF conversion (raw schema.org JSON-LD)
python geocodes_harvester.py --count 5 --output ./raw-examples

The CDIF conversion handles: property prefixing (schema:), @context/@type normalization, @list wrapping for creators, distribution fixes, subjectOf with conformsTo, type mappings (FundingAgency to Organization, Grant to MonetaryGrant, Croissant sc:Dataset to Dataset), Person name synthesis, and sameAs array normalization. All conversions are documented in each record's subjectOf description. Extra properties from the source are preserved (open-world assumption).

DCAT Conversion

DCAT/dcat_to_cdif.py converts DCAT JSON-LD catalogs or individual dataset records to CDIF-conformant schema.org JSON-LD. Maps DCAT/Dublin Core properties to schema.org equivalents per the CDIF DCAT implementation guide.

# List datasets in a DCAT catalog
python DCAT/dcat_to_cdif.py catalog.jsonld --list

# Convert selected records, validate output
python DCAT/dcat_to_cdif.py catalog.jsonld --output ./examples \
  --select 0,3,5 --catalog-name "My Catalog" --catalog-url "https://example.org/" \
  --validate

Key mappings: dcterms:title → schema:name, dcterms:description → schema:description, dcterms:modified → schema:dateModified, dcterms:license �� schema:license, dcterms:accessRights → schema:conditionsOfAccess, dcat:keyword → schema:keywords, dcat:Distribution → schema:DataDownload, dcterms:spatial → schema:spatialCoverage, dcterms:temporal → schema:temporalCoverage. Unmapped properties preserved (open world). Auto-detects Discovery vs Core profile based on spatial/temporal content.

See DCAT/README.md for the full property mapping table, PSDI catalog example, and known limitations.

DDI Conversion

DDI/ddi_to_cdif.py converts a DDI Codebook 2.5 XML file (for example, a DDI export from Harvard Dataverse) to a CDIF DataDescription JSON-LD document.

# Convert a DDI XML export (DOI is required)
python DDI/ddi_to_cdif.py input.xml --doi https://doi.org/10.7910/DVN/XXXXXX -o output.json

# Also fetch tab-file headers and file size/checksum from the Dataverse API
python DDI/ddi_to_cdif.py input.xml --doi https://doi.org/10.7910/DVN/XXXXXX \
  --fetch-headers --fetch-file-meta -o output.json

Key mappings: study titl/abstract → schema:name/schema:description, authors → schema:creator, keyword → schema:keywords, spatial/temporal coverage; DDI <var> → schema:variableMeasured (cdi:InstanceVariable); DDI <fileDscr> → schema:DataDownload (cdi:TabularTextDataSet) with CSVW properties; tab-file headers → physical mappings; caseQnty/varQnty → cdifq:nRows/nColumns. --fetch-headers and --fetch-file-meta pull headers and file size/checksum from the Dataverse API.

Note: The converter currently emits the pre-migration cdi:-prefixed data-structure properties (cdi:role, cdi:physicalDataType, cdi:hasPhysicalMapping, cdi:index, cdi:formats_InstanceVariable). These should be migrated to the current cdif: prefix before the output is treated as current-schema CDIF.

MetadataExamples

The MetadataExamples/ directory contains sample CDIF JSON-LD documents for testing:

File	Technique	Description
`tof-htk9-f770.json`	ToF-SIMS	Time-of-flight mass spectrometry particle analysis
`xrd-2j0t-gq80.json`	XRD	X-ray diffraction
`xanes-2arx-b516.json`	XANES	X-ray absorption near-edge structure
`yv1f-jb20.json`	--	General dataset
`test_se_na2so4-testschemaorg-cdiv3.json`	XAS	X-ray absorption spectroscopy with DDI-CDI data structure (WideDataStructure, InstanceVariable, ValueMapping). Uses `xas:` and `cdifq:` extension namespaces
`nwis-water-quality-longdata.json`	Water Quality	NWIS groundwater nutrient analysis (464 rows, 20 columns) in `cdi:LongStructureDataSet` long (narrow) format with `Descriptor`/`ReferenceVariable` roles, `cdif:hasPhysicalMapping`, and 5 `Measure` domain variables. Declares `core`/`discovery`/`data_description`/`data_structure` 1.1 conformance. Validates against graph schema (`CDIF-graph-schema-2026.json`)
`prov-ocean-temp-example.json`	Ocean Temperature	Extended provenance example demonstrating `cdifProv` building block: action chaining (`schema:object`/`schema:result`), multi-typed `["schema:Action", "prov:Activity"]` activities, agents with Role wrappers, inline `schema:HowTo` methodology via `schema:actionProcess` with 3 steps, diverse instruments, facility location, and backward-compatible `prov:used`. Validates against graph schema

Corresponding Croissant output files are in the croissant/ directory.

DDI-CDI Resolved Schema

The ddi-cdi/cls-InstanceVariable-resolved.json file is a standalone JSON Schema (Draft 2020-12) for the DDI-CDI InstanceVariable class, derived from ddi-cdi/ddi-cdi.schema_normative.json. It resolves all $ref references into a self-contained schema suitable for use in editors like oXygen without needing the full 395-definition DDI-CDI schema.

The resolved schema applies several transformations to make the schema practical:

Reverse properties removed - 767 _OF_ reverse relationship properties stripped (use JSON-LD @reverse instead)
catalogDetails removed - Catalog-level metadata omitted from all classes
Redundant classes omitted - cls-DataPoint, cls-Datum, cls-RepresentedVariable simplified to IRI-only references
XSD types inlined - Primitive types (xsd:string, xsd:integer, etc.) replaced with inline definitions
Patterns normalized - if/then/else array patterns converted to consistent anyOf
Frequency-based $ref resolution - Common definitions (>3 uses) in $defs; rare definitions inlined

See ddi-cdi/cls-InstanceVariable-resolved-README.md for full details on the generation process, circular reference analysis, and transformation rationale.

Notes

The framed tree schemas (CDIFCompleteSchema.json, CDIFDiscoverySchema.json) are generated from building block profile resolved schemas using generate_validation_schema.py. The hand-maintained originals and the all-in-one CDIF-JSONLD-schema-2026.json are in archive/.
Legacy schema (CDIF-JSONLD-schema-schemaprefix.json) is still available for older documents.
All schema.org elements require the schema: prefix for SHACL validation compatibility.
The frame ensures that after framing, the output structure matches what the JSON schema expects.
For SHACL validation, use the corresponding .shacl or .ttl files in this repository.
@type flexibility: All @type definitions in the framed schemas use anyOf to accept either a string ("schema:Dataset") or an array (["schema:Dataset"]). JSON-LD framing may compact single-element arrays to strings; FrameAndValidate.py recursively normalizes all @type values back to arrays.
spdx:Checksum typing: All spdx:checksum objects must include "@type": "spdx:Checksum". This is required by both the JSON Schema (required: ["@type"]) and SHACL shapes (sh:class spdx:Checksum).

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.idea		.idea
DCAT		DCAT
DDI		DDI
MetadataExamples		MetadataExamples
ShaclValidation		ShaclValidation
archive		archive
croissant		croissant
ddi-cdi		ddi-cdi
docs		docs
testJSONMetadata		testJSONMetadata
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CDIF-context-2026.jsonld		CDIF-context-2026.jsonld
CDIF-frame-2026.jsonld		CDIF-frame-2026.jsonld
CDIF-graph-schema-2026.json		CDIF-graph-schema-2026.json
CDIFCompleteSchema.json		CDIFCompleteSchema.json
CDIFDataDescriptionSchema.json		CDIFDataDescriptionSchema.json
CDIFDiscoverySchema.json		CDIFDiscoverySchema.json
CLAUDE.md		CLAUDE.md
ConformanceValidate.py		ConformanceValidate.py
FrameAndValidate.py		FrameAndValidate.py
LICENSE		LICENSE
README.md		README.md
batch_validate.py		batch_validate.py
conformance-schema-map.json		conformance-schema-map.json
detect_conformance.py		detect_conformance.py
generate_graph_schema.py		generate_graph_schema.py
generate_validation_schema.py		generate_validation_schema.py
geocodes_harvester.py		geocodes_harvester.py
validate-cdif.bat		validate-cdif.bat
validate-cdif.js		validate-cdif.js
validate_building_blocks.py		validate_building_blocks.py

Folders and files

Latest commit

History

Repository files navigation

Files for validation of CDIF metadata

Table of Contents

Files

Current (2026 Schema with DDI-CDI/CSVW)

DDI-CDI Resolved Schema

Legacy (Pre-2026, in archive/)

Quick Start

Prerequisites

Validate a Document

Save Framed Output for Debugging

Batch Validate Multiple Files

Current Validation Status

Validation Workflow

Step 1: Frame the JSON-LD Document

Step 2: Validate Against Schema

Conformance-URI-Driven Validation (ConformanceValidate.py)

How it works

Local schema map

Quick start

Output

Dependencies

Differences from FrameAndValidate.py

RO-Crate Conversion and Validation

Croissant Conversion

Usage Examples

Command Line (Recommended)

oXygen XML Editor

Setup

Usage

Batch Script Options

Python

JavaScript/Node.js

Context Requirements

2026 Schema Requirements

Legacy Schema Requirements

Authoring Instances Without Prefixes

Example Instance Without Prefixes

How It Works

Deploying the Context

Schema Structure

2026 Schema Additions

Flattened Graph Schema

Building Block Sources

Graph Schema Usage

Schema Structure (Graph)

Key Transformations

Troubleshooting

Common Validation Errors

Debugging

SHACL Validation

GeoCodes Harvester

DCAT Conversion

DDI Conversion

MetadataExamples

DDI-CDI Resolved Schema

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Legacy (Pre-2026, in `archive/`)

Conformance-URI-Driven Validation (`ConformanceValidate.py`)

Differences from `FrameAndValidate.py`

Packages