Skip to content

Fix missing space between block and inline sibling elements in JSoupTextExtractor#1926

Merged
jnioche merged 2 commits into
apache:mainfrom
nicoscandolo:fix/text-extractor-block-inline-spacing
Jun 3, 2026
Merged

Fix missing space between block and inline sibling elements in JSoupTextExtractor#1926
jnioche merged 2 commits into
apache:mainfrom
nicoscandolo:fix/text-extractor-block-inline-spacing

Conversation

@nicoscandolo
Copy link
Copy Markdown
Contributor

Fixes #1925

Problem

When a block-level element (e.g. <h3>) is immediately followed by an inline element (e.g. <a>, <span>), the extracted text concatenates both without any space. For instance, this HTML from a real page (https://www.acai-island.com/contact):

<h3>Email</h3>
<a href="mailto:acaiisland1300@gmail.com">acaiisland1300@gmail.com</a>

Produces Emailacaiisland1300@gmail.com instead of Email acaiisland1300@gmail.com.

Root cause

In tail(), the space after a block element is only appended when nextSibling() instanceof TextNode. This misses cases where the next sibling is an inline Element.

Fix

Changed the condition to nextSibling() != null, so a space is appended after any block element regardless of the sibling type. The existing lastCharIsWhitespace guard prevents duplicate spaces.

Tests

Added two test cases covering block+inline combinations (<h3>+<a> and <div>+<span>).

@rzo1 rzo1 added this to the 3.6.1 milestone Jun 2, 2026
@rzo1 rzo1 added bug java Pull requests that update Java code labels Jun 2, 2026
Copy link
Copy Markdown
Contributor

@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Overall, it looks good to me.

The only thing I would like to change is, that tests/comments embed a real small business's website and email address (acaiisland1300@gmail.com, acai-island.com). For an Apache codebase that lives forever, I'd suggest anonymizing to info@example.com / generic text. The issue reference (#1925) already preserves the real-world provenance.

In addition, the existing comment above the condition still reads "a space between block tags and immediately following text nodes" - it could be updated to "following siblings" to match the new behavior.

@rzo1 rzo1 requested review from dpol1 and jnioche June 2, 2026 17:51
Copy link
Copy Markdown
Contributor

@dpol1 dpol1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. lastCharIsWhitespace already guards against double spaces so widening to != null is safe. comment + tests good.

Copy link
Copy Markdown
Contributor

@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nicoscandolo and our reviewers

@jnioche jnioche merged commit 46bf3f2 into apache:main Jun 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug java Pull requests that update Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSoupTextExtractor: missing space between block and inline sibling elements

4 participants