Deck Chairs and Fiddles: PDFx

Sunday, November 7, 2021

PDFx

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Features

Extract references and metadata from a given PDF
Detects pdf, url, arxiv and doi references
Fast, parallel download of all referenced PDFs
Find broken hyperlinks (using the -c flag) (more)
Output as text or JSON (using the -j flag)
Extract the PDF text (using the --text flag)
Use as command-line tool or Python package
Compatible with Python 2 and 3
Works with local and online pdfs

https://github.com/metachris/pdfx

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)