Details
-
Bug
-
Status: Closed
-
Top
-
Resolution: Fixed
-
None
-
None
-
None
-
Platform sprint 126, Platform sprint 127
Description
This is a regression in repo 7.9 wrt 7.8. Also needs to be forward ported to 10 and 11.
Problem is as follows:
In 7.8, after the Tika Parser has been created in JackrabbitParser (with tika-config.xml), the org.apache.jackrabbit.core.query.lucene.JackrabbitParser#setTextFilterClasses are invoked (in general our repository.xml had those text filter classes configured). As a result, only about 10 parsers (like pdf, html, text, doc, etc) would end up in the Tika parsers.
In 7.9, after the Tika parser has been created in SearchIndex, the setTextFilterClasses is not invoked any more, leaving the Tika parser configured with about 120 parsers. The Tike parser configuration (tika-config.xml) in 7.9 (and also 7.8 but overridden due to setTextFilterClasses) is as follows:
<properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"/> <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser"> <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 --> <mime>application/pdf</mime> </parser> <parser class="org.apache.tika.parser.EmptyParser"> <!-- Disable package extraction as it's too resource-intensive --> <mime>application/x-archive</mime> <mime>application/x-bzip</mime> <mime>application/x-bzip2</mime> <mime>application/x-cpio</mime> <mime>application/x-gtar</mime> <mime>application/x-gzip</mime> <mime>application/x-tar</mime> <mime>application/zip</mime> <!-- Disable image extraction as there's no text to be found --> <mime>image/bmp</mime> <mime>image/gif</mime> <mime>image/jpeg</mime> <mime>image/png</mime> <mime>image/vnd.wap.wbmp</mime> <mime>image/x-icon</mime> <mime>image/x-psd</mime> <mime>image/x-xcf</mime> </parser> </parsers> </properties>
The problem with this is, that the call org.apache.jackrabbit.core.query.lucene.NodeIndexer#isSupportedMediaType returns now for example for
image/png
true instead of false in 7.8. As a result, the png stream is fetched from the database, and then parsed as empty via the EmptyParser.
The solution is fairly simple AFAICS. The tika-config.xml should look as follows:
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.DefaultParser"> <mime-exclude>image/jpeg</mime-exclude> <!-- and all the other 'empty parser' configs above --> <!-- also add exclusion of ExecutableParser by default --> <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/> </parser> </parsers> </properties>
I should still test it, but quite surely, with the above config, there won't be a parser for 'image/jpeg' and thus, isSupportedMediaType will return false (instead of true with EmptyParser)
Downside is, it is not easy to override the tika-config.xml (only by specifying a file path but I want to override it in a jar file). Also the org.apache.jackrabbit.core.query.lucene.SearchIndex#createParser method is private, hence we cannot override it. Also, we rather should not do that because SearchIndex uses the private 'parser' instance module.
It would be nice if in SearchIndex
else { ClassLoader loader = SearchIndex.class.getClassLoader(); url = loader.getResource(tikaConfigPath); }
would be
else { ClassLoader loader = this.getClassLoader(); url = loader.getResource(tikaConfigPath); if (url == null) { ClassLoader loader = SearchIndex.class.getClassLoader(); url = loader.getResource(tikaConfigPath); } }
This way, you can override the tika-config.xml from a class that extends SearchIndex like we do.
Alternatively, we could just sanitize the tika-index.xml in jackrabbit. That might be enough
Attachments
Issue Links
- is cloned by
-
REPO-1458 [Forward port 3.2.2] Indexing performance degradation of images (and other binaries)
- Closed
-
REPO-1459 [Forward port Master] Indexing performance degradation of images (and other binaries)
- Closed
-
REPO-1457 [Forward port 3.1.3] Indexing performance degradation of images (and other binaries)
- Closed