Fix missing space between block and inline sibling elements in JSoupTextExtractor#1926
Conversation
…extExtractor Fixes apache#1925
rzo1
left a comment
There was a problem hiding this comment.
Thanks for the fix. Overall, it looks good to me.
The only thing I would like to change is, that tests/comments embed a real small business's website and email address (acaiisland1300@gmail.com, acai-island.com). For an Apache codebase that lives forever, I'd suggest anonymizing to info@example.com / generic text. The issue reference (#1925) already preserves the real-world provenance.
In addition, the existing comment above the condition still reads "a space between block tags and immediately following text nodes" - it could be updated to "following siblings" to match the new behavior.
dpol1
left a comment
There was a problem hiding this comment.
LGTM. lastCharIsWhitespace already guards against double spaces so widening to != null is safe. comment + tests good.
jnioche
left a comment
There was a problem hiding this comment.
thanks @nicoscandolo and our reviewers
Fixes #1925
Problem
When a block-level element (e.g.
<h3>) is immediately followed by an inline element (e.g.<a>,<span>), the extracted text concatenates both without any space. For instance, this HTML from a real page (https://www.acai-island.com/contact):Produces
Emailacaiisland1300@gmail.cominstead ofEmail acaiisland1300@gmail.com.Root cause
In
tail(), the space after a block element is only appended whennextSibling() instanceof TextNode. This misses cases where the next sibling is an inline Element.Fix
Changed the condition to
nextSibling() != null, so a space is appended after any block element regardless of the sibling type. The existinglastCharIsWhitespaceguard prevents duplicate spaces.Tests
Added two test cases covering block+inline combinations (
<h3>+<a>and<div>+<span>).