[REPO-1458] [Forward port 3.2.2] Indexing performance degradation of images (and other binaries) - Issues

XML

Word

Printable

Details

Type: Bug
Status: Closed
Priority: Top
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.2.2
Component/s: None
Labels:
None

Sprint:
Platform sprint 126, Platform sprint 127

Description

This is a regression in repo 7.9 wrt 7.8. Also needs to be forward ported to 10 and 11.

Problem is as follows:

In 7.8, after the Tika Parser has been created in JackrabbitParser (with tika-config.xml), the org.apache.jackrabbit.core.query.lucene.JackrabbitParser#setTextFilterClasses are invoked (in general our repository.xml had those text filter classes configured). As a result, only about 10 parsers (like pdf, html, text, doc, etc) would end up in the Tika parsers.

In 7.9, after the Tika parser has been created in SearchIndex, the setTextFilterClasses is not invoked any more, leaving the Tika parser configured with about 120 parsers. The Tike parser configuration (tika-config.xml) in 7.9 (and also 7.8 but overridden due to setTextFilterClasses) is as follows:

<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>

    <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser">
      <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 -->
      <mime>application/pdf</mime>
    </parser>

    <parser class="org.apache.tika.parser.EmptyParser">
      <!-- Disable package extraction as it's too resource-intensive -->
      <mime>application/x-archive</mime>
      <mime>application/x-bzip</mime>
      <mime>application/x-bzip2</mime>
      <mime>application/x-cpio</mime>
      <mime>application/x-gtar</mime>
      <mime>application/x-gzip</mime>
      <mime>application/x-tar</mime>
      <mime>application/zip</mime>
      <!-- Disable image extraction as there's no text to be found -->
      <mime>image/bmp</mime>
      <mime>image/gif</mime>
      <mime>image/jpeg</mime>
      <mime>image/png</mime>
      <mime>image/vnd.wap.wbmp</mime>
      <mime>image/x-icon</mime>
      <mime>image/x-psd</mime>
      <mime>image/x-xcf</mime>
    </parser>

  </parsers>

</properties>

The problem with this is, that the call org.apache.jackrabbit.core.query.lucene.NodeIndexer#isSupportedMediaType returns now for example for

image/png

true instead of false in 7.8. As a result, the png stream is fetched from the database, and then parsed as empty via the EmptyParser.

The solution is fairly simple AFAICS. The tika-config.xml should look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <!-- and all the other 'empty parser' configs above -->
      <!-- also add exclusion of ExecutableParser by default -->
      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
  </parsers>
</properties>

I should still test it, but quite surely, with the above config, there won't be a parser for 'image/jpeg' and thus, isSupportedMediaType will return false (instead of true with EmptyParser)

Downside is, it is not easy to override the tika-config.xml (only by specifying a file path but I want to override it in a jar file). Also the org.apache.jackrabbit.core.query.lucene.SearchIndex#createParser method is private, hence we cannot override it. Also, we rather should not do that because SearchIndex uses the private 'parser' instance module.

It would be nice if in SearchIndex

 else {
          ClassLoader loader = SearchIndex.class.getClassLoader();
            url = loader.getResource(tikaConfigPath);
            }

would be

else {
            ClassLoader loader = this.getClassLoader();
             url = loader.getResource(tikaConfigPath);
             if (url == null) {
                 ClassLoader loader = SearchIndex.class.getClassLoader();
                 url = loader.getResource(tikaConfigPath);
             }
   }

This way, you can override the tika-config.xml from a class that extends SearchIndex like we do.

Alternatively, we could just sanitize the tika-index.xml in jackrabbit. That might be enough

Attachments

Issue Links

clones

REPO-1451 Indexing performance degradation of images (and other binaries)

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ard Schrijvers

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Mar/16 10:19 AM

Updated:: 30/Mar/16 11:00 AM

Resolved:: 30/Mar/16 11:00 AM