Question

我想使用Python和PYPDF包从pdf文件中提取文本。这是我的pdf fie，这是我的代码：

import PyPDF2
opened_pdf = PyPDF2.PdfFileReader('test.pdf', 'rb')

p=opened_pdf.getPage(0)

p_text= p.extractText()
# extract data line by line
P_lines=p_text.splitlines()
print P_lines

我的问题是P_lines无法逐行提取数据并导致一个巨大的字符串。我想逐行提取文本来分析它。有关如何改进它的任何建议？谢谢！这是代码返回的字符串：

[u＆＃39;符合29 CFR 1910.1200（i）的化学品的成分信息附录D来自供应商材料安全数据表（MSDS）**信息基于最大潜力浓度因此总量可能超过100％*总水量来源可包括淡水，产出水和/或再循环 water0.01271％72.00％7732-18-5Water0.00071％4.00％1310-73-2Sodium Hydroxide0.00424％24.00％533-74-4DazomatBiocidePumpcoPlexcide 24L0.00828％75.00％有机膦酸盐0.00276％25.00％67-56-1Methyl AlcoholScale InhibitorPumpcoPlexaid 6730.00807％30.00％7732-18-5Water0.00188％7.00％聚乙氧基醇表面活性剂0.00753％28.00％9003-06-9铵盐类0.00941％35.00％64742-47-8石油馏分摩擦 ReducerPumpcoPlexslick 9210.05029％60.00％7732-18-5Water0.03353％40.00％7647-01-0氯化氢盐酸PumpcoHCL9.84261％100.00％14808-60-7Crystaline SilicaProppantPumpcoSand90.01799％100.00％7732-18-5WaterCommentsMaximumIngredientConcentrationin HF Fluid（质量百分比）** MaximumIngredientConcentrationin Additive（％by 质量）**化学抽象服务号码（CAS ＃）IngredientsPurposeSupplierTrade名称水力压裂液成分：2,608,032总水量（gal）*：7,595真正垂直深度（TVD）：GasProduction类型：NAD27Long / Lat 投影：32.558525Latitude：-97.215242Longitude：Ole Gieser Unit D. 6HWell名称和编号：XTO EnergyOperator名称：42-439-35084API 编号：TarrantCounty：TexasState：12/10/2010Fracture DateHydraulic 压裂液产品组件信息披露＆＃39;]

该文件的屏幕截图：

Answer 1

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('test.pdf').strip().split('\n\n'))

输出

水力压裂液产品组件信息披露

断裂日期状态：县：API编号：运营商名称：井名和   数量：经度：纬度：长/纬投影：生产类型：   真实垂直深度（TVD）：总水量（gal）*：

12/10/2010 Texas Tarrant 42-439-35084 XTO Energy Ole Gieser Unit D 6H   -97.215242   32.558525 NAD27 Gas 7,595 2,608,032

水力压裂液组成：

商号

供应商

目的

成分</ P>
化学文摘服务编号

（CAS＃）

最大成分

浓度
     添加剂中的
（按质量计）**

评论

最大成分

浓度
     HF流体中的
（按质量计）**

水砂HCL

Pumpco Pumpco

支撑剂盐酸

Plexslick 921

Pumpco

摩擦减速剂

Plexaid 673

Pumpco

阻垢剂

Plexcide 24L

Pumpco

杀生物剂

Crystaline Silica

氯化氢水

石油馏分铵盐聚乙氧基化醇   表面活性剂水

甲醇有机膦酸盐

Dazomat氢氧化钠水

7732-18-5 14808-60-7

7647-01-0 7732-18-5

64742-47-8 9003-06-9

7732-18-5

67-56-1

533-74-4 1310-73-2 7732-18-5

100.00   100.00

90.01799   9.84261

40.00   60.00

35.00   28.00   7.00   30.00

25.00   75.00

24.00   4.00   72.00

0.03353   0.05029

0.00941   0.00753   0.00188   0.00807

0.00276   0.00828

0.00424   0.00071   0.01271


总水量来源可包括淡水，采出水和/或循环水   **信息基于最大浓度潜力，因此总数可能超过100


符合29 CFR 1910.1200（i）的化学品的成分信息   附录D来自供应商材料安全数据表   （MSDS）

Answer 2

确保您导入的PDF实际上有新行。如果它没有，那么p_text.splitlines()无法分割字符串！如果有特定字符，您可以使用p_text.split("the linebreak character")。

编辑：根据您的PDF，我不确定是否有办法逐行拆分，因为它似乎是静态格式而不是线性格式。（文本按PDF中的位置放置，而不是逐行放置。）

Answer 3

这是我想出的功能，它完全基于@SmartManoj答案，但通过使用with语句已更新为更简洁的方法（在我看来），消除了不必要的变量（即，使用关键字参数的变量）自我解释），以及产生页面文字。

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def pages_as_txt(path) -> Generator[str, None, None]:
    rsrcmgr = PDFResourceManager()
    with StringIO() as retstr, TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=LAParams()) as device:
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        with open(path, 'rb') as fp:
            for page in PDFPage.get_pages(fp, check_extractable=False):
                interpreter.process_page(page)
                yield retstr.getvalue()
                retstr.truncate(0)
                retstr.seek(0)

Answer 4

textract在Python3中可以使用tesseract方法正常工作。示例代码：

import textract
text = textract.process("pdfs/testpdf1.pdf", method='tesseract')
print(text)
with open('textract-results.txt', 'w+') as f:
    f.write(str(text))

使用Python和Pypdf2从pdf中提取文本

4 个答案: