Question

我正在尝试将列表中的n-gram /（多个单词）匹配到文本/字符串。

我的样本匹配列表包含诸如：-

matching_list = ['Data Scientist',
 'Associate Research Scientist',
 'Post Doctoral Research Fellow',
 'Research Scientist',
 'Assistant Professor', 
 'c# developer', 
 '.net engineer']

我的示例文本在解析后包含诸如：-

text ='我是一名公司客户经理，具有数据科学家，副研究员，博士后研究员，.Net工程师C＃开发人员助理教授的经验。

我遵循了将匹配列表和文本转换为小写字母，然后使用以下代码进行搜索的过程。

import re

# Uncomment when Matching 4-gram words
#findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*(?:\s[A-Z]\w*(?:\s[A-Z]\w*)?)?)?)')

# Uncomment when Matching tri-gram words
#findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*(?:\s[A-Z]\w*)?)?)')

# Uncomment when Matching bi-gram words
findnames = re.compile(r'([A-Z]\w*(?:\s[A-Z]\w*)?)')

def is_name_in_text(text, matching_list):
        for possible_name in set(findnames.findall(text)):
            if possible_name in matching_list:
                print(possible_name)
        return possible_name

is_name_in_text(text, matching_list)

我希望二元语法匹配能够获得

    Research Scientist
    Data Scientist
    Assistant Professor
    c# developer
    .net engineer

但是，我得到以下输出

     Data Scientist
     Assistant Professor

1）我无法匹配特殊字符。

2）此外，匹配的二元语法为2个单词，三元语法为3个单词，依此类推。并不是将匹配的短语在整个句子中一个字一个字地移动，而是，如果找不到匹配项，我觉得二元语法每次跳跃2个单词，三元语法每次跳跃3个单词。如果二元语法从奇数个位置开始，而二元语法从偶数个位置开始，依此类推，则会引起问题。

我的列表由7个特殊字符组成，例如＃，@，+，。，_，-和*

我需要在语料库中修复特殊字符和逐字模式匹配。我无法提出像re.compile（r'（[A-Z] \ w *（？：\ s [A-Z] \ w *）？）'）这样的合适re表达式。

我不确定三元组和四元组的re表达式。

Answer 1

您要匹配单词级n-gram，特别是单词级双字母组。

但是，您提供的正则表达式：([A-Z]\w*(?:\s[A-Z]\w*)?)匹配单词字符串的任何字符串，后跟A到Z范围内的字符，可以选择后面跟一个空格和另一个字符串。

使用该正则表达式将永远不会匹配c# developer，因为它不是以A到Z开头并且包含#。也不会匹配.net engineer，因为它以.开头。另外，您要匹配.net engineer，但文本中的内容为.Net engineer。

此外，通过使用该正则表达式和findall，该正则表达式将以大写单词对使用该字符串，从而防止了重用。因此，在匹配Corporate Account之后，它将永远无法匹配Account Manager，因为Account部分已经被消耗掉了。您正在使用一个非捕获组，但这仍然导致正则表达式占用字符串的那部分。

假设您要匹配不区分大小写的单词级n-gram，并且需要匹配诸如#之类的特殊字符，我认为您无法使用单个正则表达式来实现所需的功能，但是有些基本的Python代码可助您一臂之力。

请考虑过滤掉所有不完全由文字字符或您喜欢的特殊字符组成的n元语法的效率不高。为什么不简单地将字符串分成多个空格并找到您要查找的n-gram呢？

import re

text = 'I am a Corporate Account Manager with experience as Data Scientist' \
       ' Associate Research Scientist Post Doctoral Research Fellow Research' \
       ' Scientist Assistant Professor .Net engineer c# developer'

matching_list = [
    'Data Scientist',
    'Associate Research Scientist',
    'Post Doctoral Research Fellow',
    'Research Scientist',
    'Assistant Professor',
    'c# developer',
    '.net engineer'
]


def get_ngrams(words, n):
    return zip(*[words[m:len(words)-(1-m)] for m in range(n)])


def main():
    # simply split up the text, you could also just go words = text.split()
    regex = re.compile(r'[^\s]+')
    words = regex.findall(text.lower())
    # turn the list of words into ngrams of the needed length
    ngrams = list(get_ngrams(words, 2))
    # also create ngrams for the phrases in matching_list 
    # then link them to the phrases in a dict for easy reference
    matching_ngrams = {
        k: v for k, v in zip(
            [tuple(x.lower().split()) for x in matching_list], matching_list 
        )
    }

    # find all the matching ones and print the matching phrase when found
    for find_this in ngrams:
        if find_this in matching_ngrams:
            print(matching_ngrams[find_this])


main()

请注意，这仍然会生成重复项，您表示希望每个结果仅一次。您可以通过翻转循环和比较来实现这一点：

    for find_this in matching_ngrams:
        if find_this in ngrams:
            print(matching_ngrams[find_this])

这将更长时间地遍历更长的列表，花费更多的时间，但是如果每个短语在文本中，则只会打印一次。另外，您可以创建一个返回所有匹配项并将其放入set的函数。

为避免出现此列表，查找效率低下和不必要的re，我希望这样做：

def get_ngrams(words, n):
    return zip(*[words[m:len(words) - (1 - m)] for m in range(n)])


def find_matching_ngrams(text, phrases, n):
    ngrams_phrases = {
        k: v for k, v in zip(
            [tuple(x.lower().split()) for x in phrases], phrases
        )
    }

    for ngram in get_ngrams(text.lower().split(), n):
        if ngram in ngrams_phrases :
            yield ngrams_phrases[ngram]


def main():
    text = 'I am a Corporate Account Manager with experience as Data Scientist' \
           ' Associate Research Scientist Post Doctoral Research Fellow Research' \
           ' Scientist Assistant Professor .Net engineer c# developer'

    matching_list = [
        'Data Scientist',
        'Associate Research Scientist',
        'Post Doctoral Research Fellow',
        'Research Scientist',
        'Assistant Professor',
        'c# developer',
        '.net engineer'
    ]

    print(set(find_matching_ngrams(text, matching_list, 2)))


main()

可能更有效：

def get_ngrams(words, n):
    for m in range(len(words)-(n-1)):
        yield tuple(words[m:m+n])

使用正则表达式将多个单词匹配到文本

1 个答案: