我有5个PDF文件,每个文件都有指向另一个PDF文件中不同页面的链接。这些文件是大型PDF的每个目录(每个约1000页),使手动提取成为可能,但非常痛苦。到目前为止,我已经尝试在Acrobat Pro中打开该文件,我可以右键单击每个链接并查看它指向的页面,但我需要以某种方式提取所有链接。我不反对不得不对链接进行大量的进一步解析,但我似乎无法以任何方式将它们拉出来。我试图将Acrobat Pro中的PDF作为HTML或Word导出,但这两种方法都没有维护链接。
我的智慧结束了,任何帮助都会很棒。我很乐意使用Python或其他一系列语言
答案 0 :(得分:5)
使用pyPdf,
寻找URIimport pyPdf
f = open('TMR-Issue6.pdf','rb')
pdf = pyPdf.PdfFileReader(f)
pgs = pdf.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for pg in range(pgs):
p = pdf.getPage(pg)
o = p.getObject()
if o.has_key(key):
ann = o[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
print u[ank][uri]
给出,
http://www.augustsson.net/Darcs/Djinn/
http://plato.stanford.edu/entries/logic-intuitionistic/
http://citeseer.ist.psu.edu/ishihara98note.html
etc...
我找不到一个链接到另一个pdf的文件,但我怀疑URI字段应该包含file:///myfiles
答案 1 :(得分:1)
我刚刚为此制作了一个小型Python工具,列出/下载来自给定PDF的所有引用的PDF:https://www.metachris.com/pdfx/(另请:https://github.com/metachris/pdfx)
$ ./pdfx.py https://weakdh.org/imperfect-forward-secrecy.pdf -d ./
Reading url 'https://weakdh.org/imperfect-forward-secrecy.pdf'...
Saved pdf as './imperfect-forward-secrecy.pdf'
Document infos:
- CreationDate = D:20150821110623-04'00'
- Creator = LaTeX with hyperref package
- ModDate = D:20150821110805-04'00'
- PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1
- Producer = pdfTeX-1.40.14
- Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice
- Trapped = False
- Pages = 13
Analyzing text...
- URLs: 49
- URLs to PDFs: 17
JSON summary saved as './imperfect-forward-secrecy.pdf.infos.json'
Downloading 17 referenced pdfs...
Created directory './imperfect-forward-secrecy.pdf-referenced-pdfs'
Downloaded 'http://cr.yp.to/factorization/smoothparts-20040510.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/smoothparts-20040510.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35517.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35517.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35514.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35514.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35519.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35519.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35522.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35522.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35509.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35509.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35528.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35528.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35513.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35513.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35533.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35533.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35551.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35551.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35527.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35527.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35520.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35520.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35526.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35526.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35515.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35515.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35529.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35529.pdf'...
Downloaded 'http://cryptome.org/2013/08/spy-budget-fy13.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/spy-budget-fy13.pdf'...
Downloaded 'http://www.spiegel.de/media/media-35671.pdf' to './imperfect-forward-secrecy.pdf-referenced-pdfs/media-35671.pdf'...
该工具使用PyPDF2,事实上的Python标准库来读取PDF内容,regular expression to match all urls,并为每个PDF启动下载线程,如果您使用-d
运行它选项(适用于--download-pdfs
)。