在python中匹配字符串的大型列表的最佳方法

时间:2011-05-24 00:33:48

标签: python list pattern-matching

我有一个python列表,其中包含大约700个术语,我想将其用作Django中某些数据库条目的元数据。我想将列表中的条款与条目说明相匹配,以查看是否有任何条款匹配,但存在一些问题。我的第一个问题是列表中有一些包含来自其他列表条目的单词的多字词。一个例子是:

Intrusion
Intrusion Detection

我与re.findall的关系并不是很远,因为它将匹配上述示例中的入侵和入侵检测。我只想匹配入侵检测,而不是入侵。

有没有更好的方法来进行此类匹配?我想也许可能会尝试使用NLTK,但看起来它不会对这种类型的匹配产生帮助。

编辑:

因此,为了增加一点清晰度,我列出了700个术语,如防火墙或入侵检测。我想尝试将列表中的这些单词与我存储在数据库中的描述相匹配,以查看是否有任何匹配,我将在元数据中使用这些术语。所以,如果我有以下字符串:

There are many types of intrusion detection devices in production today. 

如果我有一个包含以下术语的列表:

Intrusion
Intrusion Detection

我想匹配'入侵检测',但不是'入侵'。我真的希望能够匹配单数/复数实例,但我可能会超越自己。所有这些背后的想法是采取所有匹配并将它们放在一个列表中,然后处理它们。

2 个答案:

答案 0 :(得分:2)

如果您需要更灵活地匹配条目说明,可以合并nltkre

from nltk.stem import PorterStemmer
import re

假设您对同一事件有不同的描述,即。 重写系统。您可以使用nltk.stem来捕获重写,重写,重写,单数和复数形式等。

master_list = [
    'There are many types of intrusion detection devices in production today.',
    'The CTO approved a rewrite of the system',
    'The CTO is about to approve a complete rewrite of the system',
    'The CTO approved a rewriting',
    'Breaching of Firewalls'
]

terms = [
    'Intrusion Detection',
    'Approved rewrite',
    'Firewall'
]

stemmer = PorterStemmer()

# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)

# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')

for sentence in master_list:
    match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
    matches = [m.group(0) for m in match_obs if m]
    print(matches)

<强>输出:

['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']

['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']

修改

要查看哪个terms导致匹配:

for sentence in master_list:
    # regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
    for term, pattern in zip(terms, regex_patterns):
        if re.search(pattern, sentence, flags=re.IGNORECASE):
            # process term (put it in the db)
            print('TERM: {0} FOUND IN: {1}'.format(term, sentence))

<强>输出:

TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

答案 1 :(得分:0)

这个问题不清楚,但据我所知,你有一个术语的主要清单。每行说一个术语。接下来,您有一个测试数据列表,其中一些测试数据将在主列表中,有些测试数据不在。您想要查看测试数据是否在主列表中以及是否执行任务。

假设您的主列表看起来像这样

  

入侵检测
  防火墙
  FooBar的

并且您的测试数据看起来像这样

  

入侵
  入侵检测
  foo
  吧

这个简单的脚本应该引导你朝着正确的方向发展

#!/usr/bin/env python

import sys 

def main():
  '''useage tester.py masterList testList'''   


  #open files
  masterListFile = open(sys.argv[1], 'r')
  testListFile = open(sys.argv[2], 'r')

  #bulid master list
  # .strip() off '\n' new line
  # set to lower case. Intrusion != intrusion, but should.
  masterList = [ line.strip().lower() for line in masterListFile ]
  #run test
  for line in testListFile:
    term = line.strip().lower()
    if term  in masterList:
      print term, "in master list!"
      #perhaps grab your metadata using a like %%
    else:
      print "OH NO!", term, "not found!"

  #close files
  masterListFile.close()
  testListFile.close()

if __name__ == '__main__':
  main()

SAMPLE OUTPUT

  

哦不!没有发现入侵!
  主列表中的入侵检测!
  不好了! foo没找到!
  不好了!酒吧没找到!

还有其他几种方法可以做到这一点,但这应该指向正确的方向。如果你的名单很大(700真的不是那么大)考虑使用dict,我觉得它们更快。特别是如果你打算查询数据库。也许字典结构可能看起来像{term:有关term}的信息