Question

python，用于解析在线pdf以供将来使用。我的代码如下。

from tika import parser
import requests
import io
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
    pdfFile = parser.from_file(open_pdf_file)
print(pdfFile)

但是它显示

AttributeError：'_io.BytesIO'对象没有属性'decode'

我以How can i read a PDF file from inline raw_bytes (not from file)?

为例

在此示例中，它使用的是PyPDF2。但是我需要使用Tika，因为Tika的效果比PyPDF2好。

感谢您的帮助

Answer 1

要使用tika，您将need to have JAVA 8 installed。您需要检索和打印pdf内容的代码如下：

from tika import parser

url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'

pdfFile = parser.from_file(url)

print(pdfFile["content"])

Python Tika无法从网址解析pdf

1 个答案: