解析pdf时Tabula-py字体错误tahoma

时间:2016-10-13 10:23:12

标签: python pdf debian tabula

我在debian jessie上运行。我试图用tabula-py library解析我的pdf但我收到此错误

   2016 12:16:57 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 

getawtFont  
0                                             Italic                          
1   2016 12:16:57 PM org.apache.fontbox.util.Font...                          
2                                             Italic                          
                                       Oct 13  \
0  INFO: Can't find the specified font Tahoma   
1                                      Oct 13   
2             WARNING: Font not found: Tahoma   

如何解决这个问题?

这是我的代码:

import cv2
import numpy as np
# from matplotlib import pyplot as plt
from wand.image import Image
from tabula import read_pdf_table

# Converting first page into JPG
with Image(filename="ed.pdf", resolution=200) as pdf:
    pdf.compression_quality = 99
    pdf.save(filename="temp.png")

img = cv2.imread('temp.png', 0)
img2 = img.copy()
template = cv2.imread('test cust.png', 0)
imgw, imgh = img.shape[::-1]
w, h = template.shape[::-1]

methods = ['cv2.TM_CCOEFF', 'cv2.TM_CCOEFF_NORMED', 'cv2.TM_CCORR', 'cv2.TM_CCORR_NORMED', 'cv2.TM_SQDIFF', 'cv2.TM_SQDIFF_NORMED']

for meth in methods:
    img = img2.copy()
    method = eval(meth)

    # Apply template Matching
    res = cv2.matchTemplate(img, template, method)
    min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)

    # If the method is TM_SQDIFF or TM_SQDIFF_NORMED, take minimum
    if method in [cv2.TM_SQDIFF, cv2.TM_SQDIFF_NORMED]:
        top_left = min_loc
    else:
        top_left = max_loc

    bottom_right = (top_left[0] + w, top_left[1] + h)

    top = top_left[1];
    left = top_left[0];
    bottom = imgh - bottom_right[1];
    right = imgw - bottom_right[0];

    cv2.rectangle(img, top_left, bottom_right, [0,255,0], 10)

    df = read_pdf_table('ed.pdf', area=(top,left,bottom,right))
    print(df)

错误将在此行发生

df = read_pdf_table('ed.pdf', area=(top,left,bottom,right))

2 个答案:

答案 0 :(得分:1)

我是tabula-py的作者。我想你想提取基于图像的PDF,但tabula-py不是OCR的工具。假设提取文本嵌入PDF。

我认为您应该尝试使用Google Cloud Vision API等OCR工具。

答案 1 :(得分:0)

只是添加Chezou所说的内容:Google Could Vision OCR不直接支持PDF。您首先需要使用Ghostscript之类的工具提取页面(作为图像),然后将每个页面的图像发送到API。但是,如果您的PDF有三页或更少,您可以使用免费的OCR.space PDF OCR api,它可以将整个PDF文档作为输入。