Question

我对Python还是很陌生，所以请让我知道是否已经回答了这个问题。我看了几个小时却找不到它，所以我想在这里尝试一下。

对于我的论文，我必须从PDF文档中提取关键字。到目前为止，一切正常，我编写了一个代码来计算PDF文档列表中的单个单词。但是，例如，我还需要查找单词组；例如， “企业风险管理”。

我可以使用什么扩展名来计算PDF文档中单词组合的出现次数？

部分代码：

search_list = ['risk', 'management', 'ERM']

 #Loop for words starts here
    for i in search_list:
        search_word_count = 0
        #Loop for reading the pdf document starts here
        for pageNum in range(1, pdfReader.numPages):
            pageObj = pdfReader.getPage(pageNum)
            text = pageObj.extractText().encode('utf-8')
            #Convert text to lower case letters and split words (other command than split for 3 words)
            search_text = text.lower().split()
             #loop for counting words starts here
                if i in word.decode("utf-8"):
                    search_word_count += 1
        print("The word {} was found {} times".format(i, search_word_count))

还有，有人知道如何解决我分开的事实吗？

非常感谢您！

如何计算pdf文档中是否有一组单词？

0 个答案: