以下是我在pdf
文件中拥有的数据,我想在其中使用关键字作为100
并使用关键字"US stock price 100"
提取行"US stock price"
中的整数sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem.
Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur?
Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
US stock price 100
"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium,
totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.
Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt.
Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit,
Abb price 50
python?
****下面的PDF文件行*****
import PyPDF2
pdfFileObject = open(path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
Text=page.extractText()
print(Text)
以下是我用于文本提取的代码:
private ClientConfig prepareClientConfig() {
ClientConfig config = new ClientConfig();
CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
final AuthScope ntlmAuthScope = new AuthScope(null, -1, AuthScope.ANY_REALM, "NTLM");
credentialsProvider.setCredentials(ntlmAuthScope, new NTCredentials(userId, password, null, null));
config.property(ApacheClientProperties.CREDENTIALS_PROVIDER, credentialsProvider);
config.property(ClientProperties.REQUEST_ENTITY_PROCESSING, RequestEntityProcessing.BUFFERED);
config.connectorProvider(new ApacheConnectorProvider());
return config;
}
答案 0 :(得分:0)
下面是在PDF文件中搜索关键字的代码。
import PyPDF2
import re
object = PyPDF2.PdfFileReader("test.pdf")
numPages = object.getNumPages()
string = "US stock price"
for i in range(0, numPages):
pageObj = object.getPage(i)
print("this is page " + str(i))
txt = pageObj.extractText()
resSearch = re.search(string, txt)
print(resSearch)
答案 1 :(得分:0)
您可以尝试使用软件包tika。
((3, -1), (2, -2))
答案 2 :(得分:0)
我看到您正在使用 PyPDF2 ,所以我提供了该模块的示例。我还提供了使用 tika 模块的示例。我决定使用正则表达式提取请求的文本。
import re as regex
import PyPDF2
pdfFileObject = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
text = page.extractText()
# joining lines, because PyPDF2
# output isn't formatted correctly
pdf_text = ''.join(text.splitlines())
find_stock_price = regex.findall(r'us stock price\s{2,}\d{2,4}\s', pdf_text, regex.IGNORECASE)
if find_stock_price:
# attempt to clean the output
reformat_price = [regex.sub(r'\s\s+' , ' ', str(x).strip()) for x in find_stock_price]
print(reformat_price)
# output
['US stock price 100']
import re as regex
from tika import parser
parsedPDF = parser.from_file("test.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')
# joining lines, because tika
# output isn't formatted correctly
pdf_text = ''.join(pdf.splitlines())
find_stock_price = regex.findall(r'us stock price\s{2,}\d{2,4}\s', pdf_text, regex.IGNORECASE)
if find_stock_price:
# attempt to clean the output
reformat_price = [regex.sub(r'\s\s+' , ' ', str(x).strip()) for x in find_stock_price]
print(reformat_price)
# output
['US stock price 100']