Tuesday, March 28, 2017

Tika

"The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page. Please see the Getting Started page for more information on how to start using Tika.

The Parser and Detector pages describe the main interfaces of Tika and how they work.
https://tika.apache.org/

" A Python port of the Apache Tika library that makes Tika available using the Tika REST Server."

https://github.com/chrismattmann/tika-python

No comments:

Post a Comment