Uploaded image for project: 'Hippo CMS'
  1. Hippo CMS
  2. CMS-9407

Minimalize the transitive dependencies pulled in by tika-parser

    XMLWordPrintable

Details

    • Improvement
    • Status: Closed
    • Normal
    • Resolution: Fixed
    • None
    • CMS-10.0-GA
    • None

    Description

      We use tika-parser (1.3) which has a very large footprint of 'optional' dependencies, many of which we don't need and shouldn't by default pull in.

      Default excluded tika-parser parsers are:

      • PKCS7 signed messages (bouncycastle bcmail/bcprov)
      • audio and video formats (vorbis, mp4)
      • NetCDF and HDF
      • MIME4J (raw email and mbox files)
      • EXIF (image full text indexing)
      • Rome (RSS and Atom feeds)
      • Common Compress (archives like zips) (this had to be reverted because of regression in CMS7-9412)
      • asm (Java classes)
      • Boilerpipe (surplus "clutter" around main html content)

      Anyone needing such resources indexed with tika-parser can and should add the needed dependencies explicity instead.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adouma Ate Douma
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: