我目前正在尝试通过多个pdf搜索某些设备。我已经弄清楚了如何在python中解析pdf文件以及设备列表。我目前在实际搜索功能上遇到麻烦。我发现最好的在线方法是将文本标记化并使用关键字(下面的代码)进行搜索,但是不幸的是,设备的某些名称长多个单词,导致这些名称被标记为无意义的单词例如在文本中多次出现的“蓝色”和“蒸发”,从而使收益饱和。我想到的唯一解决方法是只在设备名称中查找唯一的词并删除更常见的词,但是我想知道是否还有更优雅的解决方案,因为即使唯一的词也有趋势每个文档有多个错误的回报。
主要是,我正在寻找一种方法来搜索文本文件中的词组,例如“ Blue Transmitter 3”,而无需将该词组解析为[“ Blue”,“ Transmitter”,“ 3”]
这是我到目前为止所拥有的
import PyPDF2
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import re
#open up pdf and get text
pdfName = 'example.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
text = ""
for i in range(read_pdf.getNumPages()):
page = read_pdf.getPage(i)
text += "Page No - " + str(1+read_pdf.getPageNumber(page)) + "\n"
page_content = page.extractText()
text += page_content + "\n"
#tokenize pdf text
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',','.']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
#take out the endline symbol and join the whole equipment data set into one long string
lines = [line.rstrip('\n') for line in open('equipment.txt')]
totalEquip = " ".join(lines)
tokens = word_tokenize(totalEquip)
trash = ['Black', 'furnace', 'Evaporation', 'Evaporator', '500', 'Chamber', 'A']
searchWords = [word for word in tokens if not word in stop_words and not word in punctuations and not word in trash]
for i in searchWords:
for word in splitKeys:
if i.lower() in word.lower():
print(i)
print(word + "\n")
任何人的帮助或想法都将不胜感激