Question

我正在尝试从pdf中提取文本，这在SO中多次讨论，但我仍然无法提取pdf，保留了单词之间的空格。

$python3
Python 3.5.2 (default, Sep 14 2016, 11:28:32) 
[GCC 6.2.1 20160901 (Red Hat 6.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2
>>> pdfFileObj = open('/var/tmp/acs%2Eaccounts%2E6b00452.pdf','rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pageObj = pdfReader.getPage(0)
>>> pageObj.extractText()

这是yeilding：

“TowardtheRationalDesignofNovelNoncentrosymmetricMaterials：\ nFactorsIn \ nuencingtheFrameworkStructures \ nKangMinOk \ N * DepartmentofChemistry，仲AngUniversity，84Heukseok-RO，铜雀区Seoul06974，RepublicofKorea \ nCONSPECTUS：固 - statematerialswithextendedstructureshaverevealed \ nmanyinterestingstructure-relatedch \ naracteristics.Amongmany，材料\ ncrystallizinginnoncentrosymmetric （NCS）spacegroupshaveattractedmassive \ n \ nattentionattributabletoavarietyofsuperbfunctionalpropertiessu

但是，如果我直接在终端中使用pdf2txt.py，我会得到：

$pdf2txt.py '/var/tmp/acs%2Eaccounts%2E6b00452.pdf'| more

我收到了输出：

第

pubs.acs.org/accounts

走向新型非中心对称材料的合理设计：   影响框架结构的因素

Kang Min Ok *

中央大学化学系，84 Heukseok-ro，   Dongjak-gu，Seoul 06974，大韩民国

CONSPECTUS：具有扩展结构的固态材料   揭示了许多有趣的结构相关特征。其中   许多材料在非中心对称（NCS）空间群中结晶   吸引了大量的关注，归功于各种高超   功能特性su

这是所需的输出。

我在python脚本中没有得到我做错的事。请帮忙。

Answer 1

Met the same problem. Solved by looking deeper into the pdf2txt.py script.

I bet pdf2txt.py is from pdfMiner ( pdfMiner.six for python3 ).

You should add the parameters in pdfminer.layout.LAParams().

if not no_laparams:
    laparams = pdfminer.layout.LAParams()
    for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"):
        paramv = locals().get(param, None)
        if paramv is not None:
            setattr(laparams, param, paramv)
else:
    laparams = None

To learn more about the parameters. Take a look at this post.

使用带有空格的pdfminer提取pdf

1 个答案: