Skip to content

Develop -> Master merge, release v7.0.0#282

Merged
mdorf merged 113 commits intomasterfrom
develop
Apr 21, 2026
Merged

Develop -> Master merge, release v7.0.0#282
mdorf merged 113 commits intomasterfrom
develop

Conversation

@mdorf
Copy link
Copy Markdown
Member

@mdorf mdorf commented Apr 21, 2026

Overview

This PR merges develop into master for the v7.0.0 release of ncbo/ontologies_linked_data.

This is a major release and includes the substantial synchronization and modernization work recently introduced into develop, including the large alignment effort with the AgroPortal codebase and the updates required for compatibility with the AgroPortal-based versions of goo and sparql-client.

Highlights

This release is primarily driven by the work introduced in:

Key changes include:

  • major synchronization of ontologies_linked_data with the AgroPortal implementation
  • support for schemaless Solr enabled by the updated GOO architecture
  • expanded triple-store backend compatibility, including Virtuoso and GraphDB
  • substantial refactoring of ontology submission processing
  • dynamic Solr schema support across several models
  • OAuth authentication and authorization improvements
  • infrastructure modernization, including Ruby 3.2, newer Minitest, and updated ActiveSupport compatibility
  • OntoPortal testkit integration and related CI/testing updates
  • BioPortal-specific improvements such as updated label generation logic

Prerequisites

This release assumes the AgroPortal-based replacements of the following repositories are already in place:

Notes

This PR is a release merge from develop into master. Version tagging and any follow-up release steps should be performed after the merge.

Post-merge

Tag master as:

  • v7.0.0

mdorf added 30 commits August 6, 2025 10:31
…copy the portal language label into the generic one
…copy the portal language label into the generic one
alexskr and others added 29 commits April 8, 2026 09:07
Chore: disable index_all_data by default during submission processing
This method was moved to SubmissionMetricsCalculator during a prior
refactoring but the original copy was left behind. No callers exist
in this repo or dependent projects (ontologies_api, ncbo_cron,
ncbo_annotator).
No CSV usage remains in this file after removal of metrics_for_submission.
The csv library is still required by ontology_submission.rb and
submission_mertrics_calculator.rb where it is needed.
No external callers found in this repo or dependent projects.
Keeping the method for now pending further validation.
Verifies that class_count returns -1 gracefully when no metrics
exist in the triplestore and no CSV fallback is available.
The inner rescue in metrics_for_submission caught errors, logged a
minimal message, and returned nil. This masked the real error —
the caller (compute_metrics) would then fail with NoMethodError on
nil, and the outer rescue in process_metrics would log that
misleading error instead of the root cause.

process_metrics already handles errors properly: logs the real
exception with full backtrace and sets the METRICS error status.
The inner rescue was redundant and harmful.
max_depth_fn was reading maxDepth from the CSV file generated by
owlapi_wrapper regardless of the flat flag. owlapi_wrapper has no
knowledge of BioPortal's flat designation, so it reports the real
tree depth. Now we short-circuit and return 0 for flat ontologies
before any CSV or SPARQL calculation.
class_count was falling back to reading metrics.csv from disk when
triplestore metrics were absent. This caused errors on API nodes
where the file does not exist or is missing for older submissions.
The API should always read metrics from the triplestore. The CSV
file should only be consumed during ontology parsing in ncbo_cron.
query_groupby_classes was called with rdfsSC=nil for flat ontologies,
producing invalid SPARQL (<> predicate). This was silently tolerated
by 4store but caused a SPARQL::Client::MalformedQuery error on
GraphDB, preventing the metrics status from being set.

The groupby_children results were already unused for flat ontologies
(the loop body was guarded by `unless is_flat`), so the query was
wasteful even when it didn't error. Moved the entire block inside
the `unless is_flat` guard.
During term indexing, index_doc called retrieve_hierarchy_ids per class,
issuing iterative SPARQL queries level-by-level to collect ancestors.
For large ontologies (100K+ classes), this produced hundreds of thousands
of SPARQL round-trips.

Replace with a single paginated SPARQL query to fetch all parent-child
edges, then compute the transitive closure in memory using memoized BFS.
The precomputed ancestor map is stored as a class-level cache on
LinkedData::Models::Class for the duration of bulk indexing and cleared
in an ensure block afterward.
Add test_ancestors_precompute.rb covering linear chain, diamond
inheritance, multiple roots, cycles, complex DAG, memoization, and
edge cases. All tests are pure in-memory, no triplestore required.

Add temporary per-class validation in the indexing loop that compares
precomputed ancestors against the old retrieve_hierarchy_ids SPARQL
traversal for every class. Logs warnings on mismatches. To be removed
once validated against production data.
Per-class ancestor validation is expensive (runs both old and new
for every class). Only enable it when explicitly requested via
OP_VALIDATE_ANCESTORS=1 so it does not slow down normal indexing.
When OP_VALIDATE_ANCESTORS=1 is set, log old vs new timing for each
class and whether ancestors matched or mismatched. Useful for comparing
SPARQL traversal cost against in-memory cache lookup.
libxml-ruby v6 removed lib/xml.rb which provided `require 'xml'` and
mixed LibXML into the global namespace. Switch to `require 'libxml'`
and use the fully qualified LibXML::XML:: namespace.
Per PR review feedback, maxDepth should reflect the actual hierarchy
depth regardless of the flat flag. The flat flag is a UI/browsing
concern, not a statement about ontology structure.
Clean up metrics: dead code, flat maxDepth fix, remove CSV fallback
Fix missing label retry state leaking across ontology processing
Replace @old_ancestors_result/@new_ancestors_result instance variables
with local variables. Rename to sparql_ancestors/cached_ancestors for
clarity on what each represents.

Addresses mdorf's review feedback on PR #279.
The test was pinning broken behavior (GH-274) where SKOS submissions
without skos:Concept wrongly entered the retry path. PR #277 removed
the CSV class_count fallback, which indirectly fixed the trigger — the
SPARQL-reported class count now agrees with the actual empty result,
so total_pages stays 0 and the loop exits cleanly.

Rename the test and assert the correct outcome: processing completes,
RDF_LABELS is set, ERROR_RDF_LABELS is not, and requested_lang is
cleared. Closes GH-274.
Precompute ancestor hierarchy to speed up term indexing
  libxml-ruby v6 ships only lib/libxml-ruby.rb as a top-level
  loadable file. `require 'libxml'` raises LoadError, which broke
  `require 'ontologies_linked_data'` at load time.

  Follow-up to 71cb17a, and also fixes pre-existing breakage in
  parse_diff_file.rb that had the same require.
@mdorf mdorf merged commit bcd3dd1 into master Apr 21, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants