我在python中有一个包含很多URL的列表,我循环了一下,将所有内容下载到mi桌面的地毯上。到目前为止,每个pdf都有这种名称:document0,document1,.....,documentx
我想做的是从每个pdf文件中提取关键字,但是到目前为止,我一直无法弄清楚这样做的方法。
export default gql`
extend type Query {
articles: [Article!]
search(query: String!, type: SearchType): SearchResult
}
union SearchResult = Article | User
enum SearchType {
ARTICLE
USER
}
type Article {
id: ID!
slug: String!
title: String!
description: String!
text: String!
}
type User {
id: ID!
email: String!
name: String!
}
`;
答案 0 :(得分:0)
进行外壳样式名称匹配的一种快速方法是使用glob
模块。在下面,我重写了您的代码,以从pdf文件返回匹配项生成器。然后,我们将所有文档的所有此类匹配的计数加在一起。
import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
pattern = re.compile(r'USD') # Or r'\bUSD\b' if you don't want to match words containing USD
count = 0
for path in glob('//DOCUMENTS/document*.pdf'):
matches = search_document(pattern, path)
count += sum(1 for _ in matches)
print(f"Total count is {count}") # "Total count is {}".format(count)