如何编写代码以使用python从pdf文件的同一行中提取特定文本和整数?

时间:2018-11-09 18:32:46

标签: python python-3.x

以下是我在pdf文件中拥有的数据,我想在其中使用关键字作为100并使用关键字"US stock price 100"提取行"US stock price"中的整数sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur US stock price 100 "Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, Abb price 50 python?

****下面的PDF文件行*****

import PyPDF2
pdfFileObject = open(path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    Text=page.extractText()
    print(Text)

以下是我用于文本提取的代码:

private ClientConfig prepareClientConfig() {
    ClientConfig config = new ClientConfig();
    CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
    final AuthScope ntlmAuthScope = new AuthScope(null, -1, AuthScope.ANY_REALM, "NTLM");
    credentialsProvider.setCredentials(ntlmAuthScope, new NTCredentials(userId, password, null, null));
    config.property(ApacheClientProperties.CREDENTIALS_PROVIDER, credentialsProvider);
    config.property(ClientProperties.REQUEST_ENTITY_PROCESSING, RequestEntityProcessing.BUFFERED);
    config.connectorProvider(new ApacheConnectorProvider());
    return config;
}

3 个答案:

答案 0 :(得分:0)

下面是在PDF文件中搜索关键字的代码。

import PyPDF2
import re

object = PyPDF2.PdfFileReader("test.pdf")
numPages = object.getNumPages()
string = "US stock price"
for i in range(0, numPages):
    pageObj = object.getPage(i)
    print("this is page " + str(i)) 
    txt = pageObj.extractText() 
    resSearch = re.search(string, txt)
    print(resSearch)

答案 1 :(得分:0)

您可以尝试使用软件包tika

((3, -1), (2, -2))

答案 2 :(得分:0)

我看到您正在使用 PyPDF2 ,所以我提供了该模块的示例。我还提供了使用 tika 模块的示例。我决定使用正则表达式提取请求的文本。

import re as regex
import PyPDF2

pdfFileObject = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
  page = pdfReader.getPage(i)
  text = page.extractText()

  # joining lines, because PyPDF2 
  # output isn't formatted correctly 
  pdf_text = ''.join(text.splitlines())

  find_stock_price = regex.findall(r'us stock price\s{2,}\d{2,4}\s', pdf_text, regex.IGNORECASE)
  if find_stock_price:
    # attempt to clean the output
    reformat_price = [regex.sub(r'\s\s+' , ' ', str(x).strip()) for x in find_stock_price]
    print(reformat_price)
    # output
    ['US stock price 100']

import re as regex
from tika import parser

parsedPDF = parser.from_file("test.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')

# joining lines, because tika 
# output isn't formatted correctly
pdf_text = ''.join(pdf.splitlines())

find_stock_price = regex.findall(r'us stock price\s{2,}\d{2,4}\s', pdf_text, regex.IGNORECASE)
if find_stock_price:
   # attempt to clean the output
   reformat_price = [regex.sub(r'\s\s+' , ' ', str(x).strip()) for x in find_stock_price]
   print(reformat_price)
   # output
   ['US stock price 100']