无法使用camelot阅读pdf

时间:2019-02-13 06:40:04

标签: python pdf python-camelot

我用camelot读了pdf file,但我只能得到其中的一部分。

如何阅读所有页面?

import camelot
import pandas as pd
tables = camelot.read_pdf('data.pdf', pages='all', flavor = 'stream')
df = tables[0].df

结果df

                                              0            1  \
0                                                               
1   Land Parcel                                   City          
2                                                               
3                                                               
4   Land Parcel No. CTP-1813                      Cangzhou 滄州   
5   .\n.\n.\n.\n.\n.\n.\n.\n.\n.\nCTP-1813 號地塊 .                
6   Land Parcel No. 2018GC22026                   Beihai 北海     
7   .\n.\n.\n.\n.\n.\n.\n2018GC22026 號地塊.                       
8                                                               
9                                                               
10                                                              
11                                                              
12  Land parcels A, B, C and D for                Guigang 貴港    
13  the commercial and residential                              
14  project\nin Station Plaza at                                

                      2          3          4  
0                                   Land       
1   Land Use             Site Area  Premium    
2                                   (RMB       
3                        (sq.m.)    thousand)  
4   Commercial and       97,407.3   759,400    
5   residential                                
6   Wholesale,\nretail,  159,878.4  1,067,260  
7   residential,                               
8   catering,                                  
9   commercial and                             
10  financial and                              
11  residential                                
12  Commercial and       139,600.2  631,870    
13  residential                                
14                               

我还尝试了表格,该表格包含了更多结果,但还不是全部。

2 个答案:

答案 0 :(得分:0)

Not sure why camelot does not work. Try pdfminer。在您的sample上运行良好:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

pdf_rm = PDFResourceManager()
with StringIO() as s:
    with TextConverter(pdf_rm, s, laparams=LAParams()) as d:
        with open('data.pdf', 'rb') as f:
            interpreter = PDFPageInterpreter(pdf_rm, d)
            for page in PDFPage.get_pages(f):
                interpreter.process_page(page)
            text = s.getvalue()
        s.close()

print(text)

输出:

Land Parcel

City

Land Use

Site Area

Land Parcel No. CTP-1813

CTP-1813 號地塊 . . . . . . . . . . .

Land Parcel No. 2018GC22026

2018GC22026 號地塊. . . . . . . .

Land parcels A, B, C and D for the commercial and residential project in Station Plaza at Guigang City 貴港市高鐵站前廣場商住項目 A、B、C及D地塊 . . . . . . . . . . .

Land Parcel No. 201821 and

No. 201822 為201821號及201822 號地塊. . Land Parcel No. QZ(18)049 and

No. QZ(18)050 QZ(18)049號和QZ(18)050號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land Parcel

No. 630102102006GB00321 630102102006GB00321 號地塊 . . . . . . . . . . . . . . . . . . . .

Land Parcel No. Xing Zheng

Chu (2018)45-1 滎政儲(2018)45-1號地塊 . . . . .

Land Parcel

No. XH2018GC012-1, No. XH2018GC012-2 and No. XH2018GC012-3 XH2018GC012-1號、 XH2018GC012-2號和 XH2018GC012-3號地塊. . . . . .

Land Parcel No. 2018-52

2018-52號地塊 . . . . . . . . . . . . .

Land Parcel B No. Yan

J[2018]Z003 of the Xikou Old Residence Renovation 煙J[2018]Z003號西口舊居改造 B地塊. . . . . . . . . . . . . . . . . . . . .

of Guihuang Road in Chengxin District 靈川縣城新區桂黃公路東側地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land Parcel No. BS18-1J-307

BS18-1J-307號地塊 . . . . . . . . .

Land Parcel No. Jing Tu Zheng

Chu Gua (Shun) [2018]043 京土整儲掛(順)[2018]043號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .

Land

Premium

(RMB

thousand) 759,400

Cangzhou 滄州 Commercial and

(sq.m.) 97,407.3

Beihai 北海

residential

Wholesale, retail,

159,878.4

1,067,260

residential, catering, commercial and financial and residential

Guigang 貴港 Commercial and

residential

139,600.2

631,870

Yancheng 鹽城 Commercial and

167,738.0

339,400

residential

Guiyang 貴陽 Commercial and

117,023.0

342,050

residential

Xining 西寧

Commercial and

77,075.5

404,635

residential

Xingyang 滎陽 Commercial

72,351.7

260,400

Taizhou 泰州

Commercial and

217,681.3

728,520

residential

Xuzhou 徐州

Residential

74,448.6

1,203,000

Yantai 煙臺

Residential,

107,015.1

205,776

commercial service, public management and public service

Commercial and

63,442.7

62,820

residential

Chongqing 重慶 Residential

136,246.3

238,700

Beijing 北京

Class-2

69,856.0

2,330,000

residential, institutional pension facilities and basic educational

– 4 –

Land Parcel located to the east

Guilin 桂林

答案 1 :(得分:0)

您可以尝试以下代码,并使用参数table_areas指定表边界:

tables=camelot.read_pdf("data.pdf", pages='1',flavor='stream',table_areas=['0,800,800,0'])

更多信息,https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas