Details
-
Bug
-
Status: Closed
-
Normal
-
Resolution: Fixed
-
None
-
None
-
Flagged
Description
Calls to String.toUpperCase() are sensitive regarding typographic ligatures [1], meaning that the length of the string may differ, String.length versus String.toUpperCase().length. E.g. the uppercase of single character ß is two characters SS
SimpleHtmlExtractor#getInnerHtmlSimply does line.substring(0, offset) based on the offset gotten from line.toUpperCase().indexOf(endTag); which is not correct and can lead to StringIndexOutOfBoundsException.
Reproduction: use the following as hippostd:content property of an HTML field and have it rendered by the <hst:html /> tag.
<html><body> <p>8 ligature characters: ff ß ff fl ß ff fl ß</p></body> </html>
This results in StringIndexOutOfBoundsException since index of </BODY> of the uppercased second line is applied to the second line.
A deep corner case since <html><body> is no longer used except for old content.