在python中搜索另一个更长列表中的列表项

时间:2014-01-07 09:17:47

标签: python regex list lookup

我是这个论坛的新手,因此如果这是一个很长的问题就道歉。

我正在尝试创建一个通用关键字解析器,它接受一个关键字列表和一个文本行列表(可能是从DB或自由格式文本文件生成的)。现在我试图根据关键字列表从文本行列表中提取实体,这样我就可以生成三个关键输出;

  1. 提到的关键字
  2. 提及此关键字的文字行,
  3. 文本行中提及此关键字的次数
  4. 以下是我为此编写的python代码示例。你可以看到我试图在三个阶段完成这个任务;

    阶段1 - 接受拒绝序列,以便我可以从文本行列表中删除所有已知的不需要的行

    阶段2(通过1解析) - 对关键字进行索引类型搜索,以减少我需要进行完整循环搜索的行列表

    第3阶段 - 进行完整的循环搜索。

    问题:我遇到的问题是阶段3(或代码中的第2阶段)非常低效,并且作为具有4500个元素的关键字列表的示例,对于具有近200万行的文本行,代码运行超过24小时。

    有人能建议一个更好的方法来做传球2吗? 要么 如果有更好的方法来编写整个函数?

    我是Python的初学者因此,如果我错过了一些明显的东西,那么请提前道歉。

    ##########################################################################################
    # The keyWord parser conducts a 2 pass keyword lookup and parsing.
    # Inputs:
    #  keywordIDsList - Is a list of the IDs of the keyword (Standard declaration: keywordIDsList[]= Hash value of the keyWords)
    #  KeywordDict - is the Dict of all the keywords and the associated ID.
    #          (Standard declaration: keywordDict[keywordID]=(keywordID, keyWord) where keywordID is hash value in keywordIDsList)
    #  valueIDsList - Is a list of the IDs of all the values that need to be parsed (Standard declaration: valueIDsList[]= Unique reference number of the values)
    #  valuesDict - Is the Dict of all the value lines and the associated IDs.
    #          (Standard declaration: valuesDict[uniqueValueKey]=(uniqueValueKey, valueText) where uniqueValueKey is the unique key in valueIDsList)
    #  rejectPattern - A regular expression based pattern for rejecting columns with certain types of patterns. This is an optional field.
    # Outputs:
    #  parsedHashIDsList - Is the a hash value that is generated for every successful parse results
    #  parsedResultsDict - Is actual parsed value as parsedResultsDict[parsedHashID]=(uniqueValueKey, keywordID, frequencyResult)
    #  successResultIDsList - list of all unique value references that were parsed successfully
    #  rejectResultIDsList - list of all unique value references that were rejected
    ##########################################################################################
    
    def keywordParser(keywordIDsList, keywordDict, valueIDsList, valuesDict, rejectPattern):
        parsedResultsDict = {}
        parsedHashIDsList = []
        successResultIDsList = []
        rejectResultIDsList = []
        processListPass1 = []
        processListPass2 = []
        idxkeyWordDict = {}
    
        for keyID in keywordIDsList:
            keywordID, keyWord = keywordDict[keyID]
            idxkeyWordDict[keyWord] = (keywordID, keyWord)
    
        percCount = 1
        #    optional: if rejectPattern is provided then reject lines
        # ## Some python code for processing the reject patterns - this works fine
    
        #    Pass 1: Index based matching - partial code for index based search
        for valueID in processListPass1:
            valKey, valText = valuesDict[valueID]
            try:
                keyWordVal, keywordID = idxkeyWordDict[valText]
            except:
                processListPass2.append(valueID)
    
        percCount = 0
    
        #   Pass 2: Text based search and lookup - this part of the code is extremely inefficient
    
        for valueID in processListPass2:
            percCount += 1
            valKey, valText = valuesDict[valueID]
            valSuccess = 'N'
            for keyID in keywordIDsList:
                keyWordVal, keywordID = keywordDict[keyID]
                keySearch = re.findall(keyWordVal, valText, re.DOTALL)
                if keySearch:
                    parsedHashID = hash(str(valueID) + str(keyID))
                    parsedResultsDict[parsedHashID] = (valueID, keywordID, len(keySearch))
                    valSuccess = 'Y'
            if valSuccess == 'Y':
                successResultIDsList.append(valueID)
            else:
                rejectResultIDsList.append(valueID)
    
        return (parsedResultsDict, parsedHashIDsList, successResultIDsList, rejectResultIDsList)
    

1 个答案:

答案 0 :(得分:1)

这是Aho-Corasick string matching algorithm的完美用例。在this blog post中使用python中的代码示例解释了类似的用例。