我用camelot
读了pdf file,但我只能得到其中的一部分。
如何阅读所有页面?
import camelot
import pandas as pd
tables = camelot.read_pdf('data.pdf', pages='all', flavor = 'stream')
df = tables[0].df
结果df
是
0 1 \
0
1 Land Parcel City
2
3
4 Land Parcel No. CTP-1813 Cangzhou 滄州
5 .\n.\n.\n.\n.\n.\n.\n.\n.\n.\nCTP-1813 號地塊 .
6 Land Parcel No. 2018GC22026 Beihai 北海
7 .\n.\n.\n.\n.\n.\n.\n2018GC22026 號地塊.
8
9
10
11
12 Land parcels A, B, C and D for Guigang 貴港
13 the commercial and residential
14 project\nin Station Plaza at
2 3 4
0 Land
1 Land Use Site Area Premium
2 (RMB
3 (sq.m.) thousand)
4 Commercial and 97,407.3 759,400
5 residential
6 Wholesale,\nretail, 159,878.4 1,067,260
7 residential,
8 catering,
9 commercial and
10 financial and
11 residential
12 Commercial and 139,600.2 631,870
13 residential
14
我还尝试了表格,该表格包含了更多结果,但还不是全部。
答案 0 :(得分:0)
Not sure why camelot
does not work. Try pdfminer。在您的sample上运行良好:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
pdf_rm = PDFResourceManager()
with StringIO() as s:
with TextConverter(pdf_rm, s, laparams=LAParams()) as d:
with open('data.pdf', 'rb') as f:
interpreter = PDFPageInterpreter(pdf_rm, d)
for page in PDFPage.get_pages(f):
interpreter.process_page(page)
text = s.getvalue()
s.close()
print(text)
输出:
Land Parcel
City
Land Use
Site Area
Land Parcel No. CTP-1813
CTP-1813 號地塊 . . . . . . . . . . .
Land Parcel No. 2018GC22026
2018GC22026 號地塊. . . . . . . .
Land parcels A, B, C and D for the commercial and residential project in Station Plaza at Guigang City 貴港市高鐵站前廣場商住項目 A、B、C及D地塊 . . . . . . . . . . .
Land Parcel No. 201821 and
No. 201822 為201821號及201822 號地塊. . Land Parcel No. QZ(18)049 and
No. QZ(18)050 QZ(18)049號和QZ(18)050號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .
Land Parcel
No. 630102102006GB00321 630102102006GB00321 號地塊 . . . . . . . . . . . . . . . . . . . .
Land Parcel No. Xing Zheng
Chu (2018)45-1 滎政儲(2018)45-1號地塊 . . . . .
Land Parcel
No. XH2018GC012-1, No. XH2018GC012-2 and No. XH2018GC012-3 XH2018GC012-1號、 XH2018GC012-2號和 XH2018GC012-3號地塊. . . . . .
Land Parcel No. 2018-52
2018-52號地塊 . . . . . . . . . . . . .
Land Parcel B No. Yan
J[2018]Z003 of the Xikou Old Residence Renovation 煙J[2018]Z003號西口舊居改造 B地塊. . . . . . . . . . . . . . . . . . . . .
of Guihuang Road in Chengxin District 靈川縣城新區桂黃公路東側地 塊 . . . . . . . . . . . . . . . . . . . . . . . .
Land Parcel No. BS18-1J-307
BS18-1J-307號地塊 . . . . . . . . .
Land Parcel No. Jing Tu Zheng
Chu Gua (Shun) [2018]043 京土整儲掛(順)[2018]043號地 塊 . . . . . . . . . . . . . . . . . . . . . . . .
Land
Premium
(RMB
thousand) 759,400
Cangzhou 滄州 Commercial and
(sq.m.) 97,407.3
Beihai 北海
residential
Wholesale, retail,
159,878.4
1,067,260
residential, catering, commercial and financial and residential
Guigang 貴港 Commercial and
residential
139,600.2
631,870
Yancheng 鹽城 Commercial and
167,738.0
339,400
residential
Guiyang 貴陽 Commercial and
117,023.0
342,050
residential
Xining 西寧
Commercial and
77,075.5
404,635
residential
Xingyang 滎陽 Commercial
72,351.7
260,400
Taizhou 泰州
Commercial and
217,681.3
728,520
residential
Xuzhou 徐州
Residential
74,448.6
1,203,000
Yantai 煙臺
Residential,
107,015.1
205,776
commercial service, public management and public service
Commercial and
63,442.7
62,820
residential
Chongqing 重慶 Residential
136,246.3
238,700
Beijing 北京
Class-2
69,856.0
2,330,000
residential, institutional pension facilities and basic educational
– 4 –
Land Parcel located to the east
Guilin 桂林
答案 1 :(得分:0)
您可以尝试以下代码,并使用参数table_areas指定表边界:
tables=camelot.read_pdf("data.pdf", pages='1',flavor='stream',table_areas=['0,800,800,0'])
更多信息,https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas