Skip to content

NUTCH-1807 Avoid methods relying on system-specific default locale / charset#924

Open
sebastian-nagel wants to merge 12 commits into
apache:masterfrom
sebastian-nagel:NUTCH-1807-forbidden-apis
Open

NUTCH-1807 Avoid methods relying on system-specific default locale / charset#924
sebastian-nagel wants to merge 12 commits into
apache:masterfrom
sebastian-nagel:NUTCH-1807-forbidden-apis

Conversation

@sebastian-nagel

@sebastian-nagel sebastian-nagel commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Integrate the Forbidden API checker into the Nutch build. It uses the Ant cachepath task to call the Forbidden APIs jar, cf. Forbidden API Ant Usage.

  • Fix forbidden calls in Nutch core classes.
  • For the indexer, field "binaryContent": try to decode binary content using the charset from parse metadata, cf. NUTCH-2773.
  • Code cleanup in updated classes:
    • remove trailing whitespace
    • sort and group imports
    • remove unnecessary casts
    • add missing override annotations

TODO:

  • Check plugins.
  • Fix forbidden API calls in plugins.
  • Fix issue of "failed to create task or type antlib:org.apache.ivy.ant:cachepath". Need to refer to ivy jar, downloaded into ivy/ folder.
  • Add Apache license headers (?)
    • src/test/host-protocol-mapping.txt (test file)
      Note: Yetus failure is caused by required tabs in this file.
    • src/testresources/forbidden-apis-signatures.txt
  • Integrate the Ant target "forbidden-api-checks" into automated builds, either as part (dependency) of the target "test", or by calling it directly.
    • Because the forbidden API check requires all Java files to be compiled, building sources and plugins, including unit tests (if they are to be checked) is required.

@sebastian-nagel sebastian-nagel marked this pull request as draft June 9, 2026 14:56
@sebastian-nagel sebastian-nagel changed the title Nutch 1807 forbidden apis NUTCH-1807 Avoid methods relying on system-specific default locale / charset Jun 9, 2026
@lewismc

lewismc commented Jun 10, 2026

Copy link
Copy Markdown
Member

Excellent initiative, thanks!
👀

@sebastian-nagel sebastian-nagel marked this pull request as ready for review June 11, 2026 16:37
@sebastian-nagel

Copy link
Copy Markdown
Contributor Author

Code cleaned up.

The forbidden API checks can be run per

ant forbidden-api-checks

They are not yet integrated into automated workflow runs.

@sebastian-nagel sebastian-nagel requested a review from lewismc June 11, 2026 16:40

@uschindler uschindler left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sebastian-nagel sebastian-nagel force-pushed the NUTCH-1807-forbidden-apis branch from aae94cc to 8a40ec5 Compare June 14, 2026 17:59
@sebastian-nagel

Copy link
Copy Markdown
Contributor Author
  • Resolved merge conflicts
  • Added missing license headers
  • Integrated into CI worflows: the Forbidden API Checks are only run when tests are run.

…charset

Integrate Forbidden API checker into build.
…charset

Fix forbidden calls in Nutch core classes.

Indexer, field "binaryContent": try to decode binary content using the
charset from parse metadata, cf. NUTCH-2773.

Code cleanup in updated classes:
- remove trailing whitespace
- sort and group imports
…charset

Move cachepath / taskdef task definitions into target "forbidden-api-checks"
to use the ivy lib installed in ivy/
…charset

Link to NUTCH-3181 (URL construction deprecation).
The urlfilter-domain unit tests use:
src/plugin/urlfilter-domain/data/hosts.txt
…charset

Complete classpath for forbidden API checks: add plugin jars
…charset

- Remove unnecessary string lowercasing and trim in
  ProtocolFactory.getProtocol(url).
- For efficiency, do only host/domain lookups if there are protocol mappings
  (default is not).
- Add unit tests for per-host and per-domain protocol mappings.
…charset

Enable forbidden API checks for plugins, including plugin unit tests.
Allow reflection usage, some unit tests require it.
…charset

Fix forbidden calls in Nutch core classes, unit tests and plugins.

Code cleanup in updated classes:
- remove trailing whitespace
- sort and group imports
- remove unnecessary casts
- add missing override annotations
…charset

Add license headers to configuration files.

Allow tabs in configuration files.
…charset

Change target forbidden-api-checks to only verify Nutch core and plugin
classes, not verifying test classes. Test classes are only checked
if tests are run.

Integrate into CI workflows:
Run the Forbidden API Checks when any tests are run,
that is only when there are changes in Nutch core or plugin classes.
@sebastian-nagel sebastian-nagel force-pushed the NUTCH-1807-forbidden-apis branch from 8a40ec5 to 4319278 Compare June 19, 2026 12:19
- remove tabs in Java code (added lines)
- remove @author annotations in change context
@sebastian-nagel sebastian-nagel force-pushed the NUTCH-1807-forbidden-apis branch from b6e9b1b to b845812 Compare June 19, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants