如何从各种pdf列表中提取关键字

时间:2019-08-20 17:03:03

标签: python python-3.x pdf

我在python中有一个包含很多URL的列表,我循环了一下,将所有内容下载到mi桌面的地毯上。到目前为止,每个pdf都有这种名称:document0,document1,.....,documentx

我想做的是从每个pdf文件中提取关键字,但是到目前为止,我一直无法弄清楚这样做的方法。

export default gql`
  extend type Query {
    articles: [Article!]
    search(query: String!, type: SearchType): SearchResult 
  }

  union SearchResult = Article | User

  enum SearchType {
    ARTICLE
    USER
  }

  type Article {
    id: ID!
    slug: String!
    title: String!
    description: String!
    text: String!
  }

  type User {
    id: ID!
    email: String!
    name: String!
  }
`;

1 个答案:

答案 0 :(得分:0)

进行外壳样式名称匹配的一种快速方法是使用glob模块。在下面,我重写了您的代码,以从pdf文件返回匹配项生成器。然后,我们将所有文档的所有此类匹配的计数加在一起。

import os
from glob import glob
import re
from PyPDF2 import PdfFileReader

def search_page(pattern, page):
    yield from pattern.findall(page.extractText())

def search_document(pattern, path):
    document = PdfFileReader(path)
    for page in document.pages:
        yield from search_page(pattern, page)

pattern = re.compile(r'USD')  # Or r'\bUSD\b' if you don't want to match words containing USD

count = 0

for path in glob('//DOCUMENTS/document*.pdf'):
    matches = search_document(pattern, path)
    count += sum(1 for _ in matches)

print(f"Total count is {count}")  # "Total count is {}".format(count)