我正在尝试从pdf中提取文本,这在SO中多次讨论,但我仍然无法提取pdf,保留了单词之间的空格。
$python3
Python 3.5.2 (default, Sep 14 2016, 11:28:32)
[GCC 6.2.1 20160901 (Red Hat 6.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2
>>> pdfFileObj = open('/var/tmp/acs%2Eaccounts%2E6b00452.pdf','rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pageObj = pdfReader.getPage(0)
>>> pageObj.extractText()
这是yeilding:
“TowardtheRationalDesignofNovelNoncentrosymmetricMaterials:\ nFactorsIn \ nuencingtheFrameworkStructures \ nKangMinOk \ N * DepartmentofChemistry,仲AngUniversity,84Heukseok-RO,铜雀区Seoul06974,RepublicofKorea \ nCONSPECTUS:固 - statematerialswithextendedstructureshaverevealed \ nmanyinterestingstructure-relatedch \ naracteristics.Amongmany,材料\ ncrystallizinginnoncentrosymmetric (NCS)spacegroupshaveattractedmassive \ n \ nattentionattributabletoavarietyofsuperbfunctionalpropertiessu
但是,如果我直接在终端中使用pdf2txt.py
,我会得到:
$pdf2txt.py '/var/tmp/acs%2Eaccounts%2E6b00452.pdf'| more
我收到了输出:
第
pubs.acs.org/accounts
走向新型非中心对称材料的合理设计: 影响框架结构的因素
Kang Min Ok *
中央大学化学系,84 Heukseok-ro, Dongjak-gu,Seoul 06974,大韩民国
CONSPECTUS:具有扩展结构的固态材料 揭示了许多有趣的结构相关特征。其中 许多材料在非中心对称(NCS)空间群中结晶 吸引了大量的关注,归功于各种高超 功能特性su
这是所需的输出。
我在python脚本中没有得到我做错的事。 请帮忙。
答案 0 :(得分:1)
Met the same problem. Solved by looking deeper into the pdf2txt.py
script.
I bet pdf2txt.py
is from pdfMiner
( pdfMiner.six
for python3 ).
You should add the parameters in pdfminer.layout.LAParams()
.
if not no_laparams:
laparams = pdfminer.layout.LAParams()
for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"):
paramv = locals().get(param, None)
if paramv is not None:
setattr(laparams, param, paramv)
else:
laparams = None
To learn more about the parameters. Take a look at this post.