问题陈述:
我有一个PDF,其结构类似于表格,但线条不可见。请在下面的示例中找到:
上面的图片是我的表格在PDF页面之一中的显示方式。
我的研究
How to extract table as text from the PDF using Python?-回答了这个问题,看到了所有答案。没有帮助
Tabula :尝试使用 tabula API,但它仅提取标题而不是文本,可能是因为没有行。
我可以将整个pdf转换为文本,然后尝试以正则表达式或某种方式进行数据提取。但这可能非常乏味且耗时。另外,随着PDF的更改,整个编码必须再次进行。
询问
他们是否有任何API或Python软件包可以帮助我做到这一点( Windows和Python 3.x )?
答案 0 :(得分:1)
答案 1 :(得分:0)
尝试使用Camelot并指定您的表没有这样的行:
tables = camelot.read_pdf('file.pdf', flavor = 'stream')
有关更多信息,请参阅文档https://camelot-py.readthedocs.io/en/master/
答案 2 :(得分:0)
我通过tabula-py
>>> import tabula
>>> area = [70, 30, 750, 570]
>>> page2 = tabula.read_pdf("nar_2021_editorial-2.pdf", guess=False, lattice=False,
stream=True, multiple_tables=False, area=area, pages="all",
)
>>> page2
我得到了这个结果
> 'pages' argument isn't specified.Will extract only from page 1 by default. [
> ShortTitle Text \ 0
> Arena3Dweb 3D visualisation of multilayered networks 1
> Aviator Monitoring the availability of web services 2
> b2bTools Predictions for protein biophysical features and 3
> NaN their conservation 4
> BENZ WS Four-level Enzyme Commission (EC) number ..
> ... ... 68
> miRTargetLink2 miRNA target gene and target pathway
> 69 NaN networks
> 70 mmCSM-PPI Effects of multiple point mutations on
> 71 NaN protein-protein interactions
> 72 ModFOLD8 Quality estimates for 3D protein models
>
>
> URL 0 http://bib.fleming.gr/Arena3D 1
> https://www.ccb.uni-saarland.de/aviator 2
> https://bio2byte.be/b2btools/ 3
> NaN 4 https://benzdb.biocomp.unibo.it/ ..
> ... 68 https://www.ccb.uni-saarland.de/mirtargetlink2 69
> NaN 70 http://biosig.unimelb.edu.au/mmcsm ppi 71
> NaN 72 https://www.reading.ac.uk/bioinf/ModFOLD/ [73
> rows x 3 columns]]
希望能帮到你