Question

我正在尝试在给定文本中搜索指定的单词列表。该代码非常简单。

1           2           3           4           5
 Unallocated

问题是，当我输入以下文本时：“我在车上接我的孩子，他们吃了一些梨。” 与“ ”字样不匹配“ 或“吃” 。我想这是因为在文本中它们是过去形式，而在词表中它们是不定式形式。因此，它只会搜索完全匹配项，而不会考虑词尾变化（动词形式，不规则动词等）。

有没有一种方法可以搜索文本以匹配单词列表中的单词，而无需考虑词尾变化？谢谢！

Answer 1

这是自然语言处理任务，即，这是与我们正在使用的自然语言密切相关的问题。问题不仅仅在于算法问题，因为算法首先必须“理解”或“表示”使用您使用的语言的词尾变化的工作方式。

那些解决方案与统计模型一起使用，这意味着我们将无法获得100％的准确性。这仅仅是因为自然语言太复杂而无法使用确定性算法来解决此问题，而且准确率达到100％。

对于英语，有一个Python软件包LemmInflect，该软件包声称英语动词的准确率达到96.1％。

使用此功能，我们可以执行以下操作：

import lemminflect


def find_lemmas(word_set: set, test_string: str) -> list:
    word_set = set(word_set)
    found_lemmas = []

    for word in test_string.split(" "):
        lemma_dict = lemminflect.getAllLemmas(word)
        if lemma_dict:
            # values of getAllLemmas are tuples, we need a flat set
            lemmas = {y for x in lemma_dict.values() for y in x}
            found_lemma = list(lemmas & word_set)
            if found_lemma:
                found_lemmas.append(found_lemma[0])

    return found_lemmas

这给了我们

>>> word_set = {"eat", "car", "house", "pick up", "child"}
>>> test_string = """After he ate the cake he left the 
    house and went to his car. Then he wondered whether picking up the 
    children now would really be the best idea."""
>>> find_lemmas(word_set=word_set, test_string=test_string)
['eat', 'house', 'child']

我们可以看到"pick up"未被识别。这是因为我们正在逐单词地解析test_string，这破坏了任何组合单词的结构。因此，获得那些组成的引理将需要更加复杂的逻辑。

我们可以将word_set中的项目拆分为各个组成部分，然后分别检查每个组成部分是否存在。然后，我们仍然需要一种逻辑，该逻辑能够确定word_set中组成单词的两个组成部分的变形形式的出现是否实际上是组成单词的变形形式的发生，即，我们将需要排除以下情况：

"She bent down to pick a penny. Then she looked up and realised she had lost a pound.“

在这种情况下，我们将找到"pick"的形式和"up"的形式，但这不是"pick up"的形式。

Answer 2

正如jonathan.scholbach所说，您想要做的是使文本中的单词脱位。单词的引理就是您在字典中可以找到的单词的形式。

有一种使用spacy的简单方法，它看起来像这样：

import spacy

nlp=spacy.load('en_core_web_sm')
sent = "  I picked up my children in the car and they ate some pears.."
word_list = ["eat", "car", "house", "pick up", "child"]
doc = nlp(sent)
doc_lemma = " "
for token in doc:
    #for words without a defined lemma like pronouns, spacy returns -PRON-
    #let's capture those cases and use the form in the text: 
    if token.lemma_[0] == '-':
      doc_lemma = doc_lemma + token.text.lower() + " "
    else:
        #Put the lemmas in a string, so words like "pick up" will be found as well
        doc_lemma = doc_lemma + token.lemma_ + " "

#word_list now lookks like that:
# i pick up my child in the car and they eat some pear ..
for word in word_list:
    if word in doc_lemma:
        print(word)
#output:
#    eat
#    car
#    pick up
#    child

编辑： 如评论中所述，仅当化合物彼此直接相邻时，此解决方案才匹配化合物：pick up在I picked up the apple中匹配，但在Did you pick her up?中不匹配

动词+诸如pick up这样的动词的工作环境可能是这样的：

#find root (the verb) and a corresponding particle
root= None
particle = None
for token in doc:
    if token.dep_=="ROOT":
        root= token.lemma_
if token.dep_ == "prt":
    particle= token.lemma_
#if both particle and root exist in the sentence, add them together to our final string,
#so verb + particle like "pick up" is matched, even when not next to each other.
if root is not None and particle is not None:
    doc_lemma = doc_lemma + root + " " + particle

例如，当涉及子条款时，此解决方法可能还有其他缺陷。

搜索文本中的单词，而不考虑词尾变化：Python

2 个答案: