Affects Version/s: None
Fix Version/s: 3.1.3
Processed by team:Pulsar
Sprint:Platform sprint 126, Platform sprint 127
This is a regression in repo 7.9 wrt 7.8. Also needs to be forward ported to 10 and 11.
Problem is as follows:
In 7.8, after the Tika Parser has been created in JackrabbitParser (with tika-config.xml), the org.apache.jackrabbit.core.query.lucene.JackrabbitParser#setTextFilterClasses are invoked (in general our repository.xml had those text filter classes configured). As a result, only about 10 parsers (like pdf, html, text, doc, etc) would end up in the Tika parsers.
In 7.9, after the Tika parser has been created in SearchIndex, the setTextFilterClasses is not invoked any more, leaving the Tika parser configured with about 120 parsers. The Tike parser configuration (tika-config.xml) in 7.9 (and also 7.8 but overridden due to setTextFilterClasses) is as follows:
The problem with this is, that the call org.apache.jackrabbit.core.query.lucene.NodeIndexer#isSupportedMediaType returns now for example for
true instead of false in 7.8. As a result, the png stream is fetched from the database, and then parsed as empty via the EmptyParser.
The solution is fairly simple AFAICS. The tika-config.xml should look as follows:
I should still test it, but quite surely, with the above config, there won't be a parser for 'image/jpeg' and thus, isSupportedMediaType will return false (instead of true with EmptyParser)
Downside is, it is not easy to override the tika-config.xml (only by specifying a file path but I want to override it in a jar file). Also the org.apache.jackrabbit.core.query.lucene.SearchIndex#createParser method is private, hence we cannot override it. Also, we rather should not do that because SearchIndex uses the private 'parser' instance module.
It would be nice if in SearchIndex
This way, you can override the tika-config.xml from a class that extends SearchIndex like we do.
Alternatively, we could just sanitize the tika-index.xml in jackrabbit. That might be enough