如何使用tika库解析pdf

时间:2018-02-09 07:16:09

标签: python-3.x pdf apache-tika text-extraction

我正在尝试使用evaluate [Table_1] 库解析pdf文件但是我遇到了这个复杂的错误

tika

代码

Traceback (most recent call last):
  File "/home/olivia/.local/lib/python3.6/site-packages/urllib3/connection.py", line 141, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/home/olivia/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 83, in create_connection
    raise err
  File "/home/olivia/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

使用其包装器

时会出现相同的错误
import tika
from tika import parser
parsed = parser.from_file('simple1.pdf')
print(parsed["content"])

详细错误see

3 个答案:

答案 0 :(得分:0)

请在pdf名称中指定完整路径,并为例如使用斜杠

from tika import parser

parsedPDF=parser.from_file('C:/Users/xyzuser/Documents/abc.pdf') parsedPDF

答案 1 :(得分:0)

  1. 从[https://tika.apache.org/download.html]下载tika罐(tika-app.jar,tika-server.jar和tika-server.jar.md5)[1]

  2. 将这些jar(重命名为:tika-app.jar,tika-server.jar和tika-server.jar.md5)保留在Linux和C语言的 / tmp 文件夹中:\ Users <用户> \ AppData \ Local \ Temp \(对于Windows)

    from tika import parser

    parsedPDF = parser.from_file("/path/to/file/my_pdf.pdf")

    print(parsedPDF["metadata"])

    print(parsedPDF["content"].encode('ascii', errors='ignore')

答案 2 :(得分:-1)

您只需要对代码进行如下小的修改:

parsed = parser.from_file('simple1.pdf','http://localhost:9998/tika')

为我工作,希望也为您工作:)