使用带有空格的pdfminer提取pdf

时间:2016-11-17 21:08:38

标签: python-3.x pdfminer pypdf2

我正在尝试从pdf中提取文本,这在SO中多次讨论,但我仍然无法提取pdf,保留了单词之间的空格。

$python3
Python 3.5.2 (default, Sep 14 2016, 11:28:32) 
[GCC 6.2.1 20160901 (Red Hat 6.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2
>>> pdfFileObj = open('/var/tmp/acs%2Eaccounts%2E6b00452.pdf','rb')
>>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
>>> pageObj = pdfReader.getPage(0)
>>> pageObj.extractText()

这是yeilding:

  

“TowardtheRationalDesignofNovelNoncentrosymmetricMaterials:\ nFactorsIn \ nuencingtheFrameworkStructures \ nKangMinOk \ N * DepartmentofChemistry,仲AngUniversity,84Heukseok-RO,铜雀区Seoul06974,RepublicofKorea \ nCONSPECTUS:固 - statematerialswithextendedstructureshaverevealed \ nmanyinterestingstructure-relatedch \ naracteristics.Amongmany,材料\ ncrystallizinginnoncentrosymmetric (NCS)spacegroupshaveattractedmassive \ n \ nattentionattributabletoavarietyofsuperbfunctionalpropertiessu

但是,如果我直接在终端中使用pdf2txt.py,我会得到:

$pdf2txt.py '/var/tmp/acs%2Eaccounts%2E6b00452.pdf'| more

我收到了输出:

  

     

pubs.acs.org/accounts

     

走向新型非中心对称材料的合理设计:   影响框架结构的因素

     

Kang Min Ok *

     

中央大学化学系,84 Heukseok-ro,   Dongjak-gu,Seoul 06974,大韩民国

     

CONSPECTUS:具有扩展结构的固态材料   揭示了许多有趣的结构相关特征。其中   许多材料在非中心对称(NCS)空间群中结晶   吸引了大量的关注,归功于各种高超   功能特性su

这是所需的输出。

我在python脚本中没有得到我做错的事。 请帮忙。

1 个答案:

答案 0 :(得分:1)

Met the same problem. Solved by looking deeper into the pdf2txt.py script.

I bet pdf2txt.py is from pdfMiner ( pdfMiner.six for python3 ).

You should add the parameters in pdfminer.layout.LAParams().

if not no_laparams:
    laparams = pdfminer.layout.LAParams()
    for param in ("all_texts", "detect_vertical", "word_margin", "char_margin", "line_margin", "boxes_flow"):
        paramv = locals().get(param, None)
        if paramv is not None:
            setattr(laparams, param, paramv)
else:
    laparams = None

To learn more about the parameters. Take a look at this post.