Question

我有一个词典，其中包含数千个字符串（包括：单个单词，复合词，使用连字符的复合词以及字符串）和一个包含文本文档的数据集。我希望能够计算出现在每个文本文档中的 exact 元素（在词典中存在）的数量。

我尝试过：

lexicon = ['A', 'FOO', 'f']
instance = 'fA near A AFOO FO ff'

matches = []
for word in lexicon:
    if word in instance:
       matches.append(word)

尽管预期结果将是['A']，但是上面的代码也将返回子字符串['A', 'FOO', 'F']。

使用正则表达式的第二种方法：

matches = []
for word in lexicon:
    if re.search(r'\b' + word + r'\b', instance):
    #if re.search(r'\b({})\b'.format(word), instance):
        matches.append(word)

尽管以这种方式获得的列表正是我所需要的，但出现以下错误：

File "<ipython-input-18-5331958cdf85>", line 4, in <module>
    if re.search(r'\b' + word + r'\b', instance):

  File "/opt/anaconda3/lib/python3.7/re.py", line 183, in search
    return _compile(pattern, flags).search(string)

  File "/opt/anaconda3/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/opt/anaconda3/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "/opt/anaconda3/lib/python3.7/sre_parse.py", line 938, in parse
    raise source.error("unbalanced parenthesis")

error: unbalanced parenthesis

我不知道如何解决该错误或如何以其他方式解决该问题。

任何帮助将不胜感激！

Answer 1

我认为您要查找的是词典中单词在文档中以 tokens 出现的次数。如果是这样，那应该可以：

lexicon = ['A', 'FOO', 'f']
instance = 'fA near A AFOO FO ff'

tokens = set(instance.split())
matches = []

for word in lexicon:
    if word in tokens:
        matches.append(word)

# matches should equal ['A'] in this example

Answer 2

正则表达式版本的问题在于，lexicon列表中的某些单词可能包含特殊的正则表达式字符-(，[等。

转义词典中的单词，它应该起作用：

import re

lexicon = ['A', 'FOO(()))', 'f']
instance = 'fA near A AFOO FO ff'

matches = []
for word in lexicon:
    if re.search(r'\b' + re.escape(word) + r'\b', instance):
        matches.append(word)

print(matches)

打印：

['A']

Python完全匹配-字符串中的Lexicon元素完全匹配

2 个答案: