Sunday, November 7, 2021

PDFx

 

Extract references (pdf, url, doi, arxiv) and metadata from a PDF. Optionally download all referenced PDFs and check for broken links.

Features

  • Extract references and metadata from a given PDF
  • Detects pdf, url, arxiv and doi references
  • Fast, parallel download of all referenced PDFs
  • Find broken hyperlinks (using the -c flag) (more)
  • Output as text or JSON (using the -j flag)
  • Extract the PDF text (using the --text flag)
  • Use as command-line tool or Python package
  • Compatible with Python 2 and 3
  • Works with local and online pdfs

https://github.com/metachris/pdfx 

No comments:

Post a Comment