我是这个论坛的新手,因此如果这是一个很长的问题就道歉。
我正在尝试创建一个通用关键字解析器,它接受一个关键字列表和一个文本行列表(可能是从DB或自由格式文本文件生成的)。现在我试图根据关键字列表从文本行列表中提取实体,这样我就可以生成三个关键输出;
以下是我为此编写的python代码示例。你可以看到我试图在三个阶段完成这个任务;
阶段1 - 接受拒绝序列,以便我可以从文本行列表中删除所有已知的不需要的行
阶段2(通过1解析) - 对关键字进行索引类型搜索,以减少我需要进行完整循环搜索的行列表
第3阶段 - 进行完整的循环搜索。
问题:我遇到的问题是阶段3(或代码中的第2阶段)非常低效,并且作为具有4500个元素的关键字列表的示例,对于具有近200万行的文本行,代码运行超过24小时。
有人能建议一个更好的方法来做传球2吗? 要么 如果有更好的方法来编写整个函数?
我是Python的初学者因此,如果我错过了一些明显的东西,那么请提前道歉。
##########################################################################################
# The keyWord parser conducts a 2 pass keyword lookup and parsing.
# Inputs:
# keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
# KeywordDict - is the Dict of all the keywords and the associated ID.
# (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
# valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
# valuesDict - Is the Dict of all the value lines and the associated IDs.
# (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
# rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
# Outputs:
# parsedHashIDsList - Is the a hash value that is generated for every successful parse results
# parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
# successResultIDsList - list of all unique value references that were parsed successfully
# rejectResultIDsList - list of all unique value references that were rejected
##########################################################################################
def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
parsedResultsDict = {}
parsedHashIDsList = []
successResultIDsList = []
rejectResultIDsList = []
processListPass1 = []
processListPass2 = []
idxkeyWordDict = {}
for keyID in keywordIDsList:
keywordID, keyWord = keywordDict[keyID]
idxkeyWordDict[keyWord] = (keywordID, keyWord)
percCount = 1
# optional: if rejectPattern is provided then reject lines
# ## Some python code for processing the reject patterns - this works fine
# Pass 1: Index based matching - partial code for index based search
for valueID in processListPass1:
valKey, valText = valuesDict[valueID]
try:
keyWordVal, keywordID = idxkeyWordDict[valText]
except:
processListPass2.append(valueID)
percCount = 0
# Pass 2: Text based search and lookup - this part of the code is extremely inefficient
for valueID in processListPass2:
percCount += 1
valKey, valText = valuesDict[valueID]
valSuccess = 'N'
for keyID in keywordIDsList:
keyWordVal, keywordID = keywordDict[keyID]
keySearch = re.findall(keyWordVal, valText, re.DOTALL)
if keySearch:
parsedHashID = hash(str(valueID) + str(keyID))
parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
valSuccess = 'Y'
if valSuccess == 'Y':
successResultIDsList.append(valueID)
else:
rejectResultIDsList.append(valueID)
return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)
答案 0 :(得分:1)
这是Aho-Corasick string matching algorithm的完美用例。在this blog post中使用python中的代码示例解释了类似的用例。