Question

我有一个如下的python程序。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

鉴于sentences和selected_concepts，我需要按句子顺序从selected_concepts中提取单词级最长的概念匹配项。 sentences和selected_concepts都经过了预处理，因此它们不包含任何标点符号，其他

例如；

sentences = ["i love data mining and machine learning", "python is the best programming language", "the learning process of data mining and python are awesome"]

selected_concepts = ["learning", "python", "programming language", "d", "dat", "data", "data mining", "a m", "machine learning", "l"]

当前输出：

[['l', 'd', 'dat', 'data', 'data mining', 'a m', 'machine learning', 'learning'], ['python', 'programming language', 'l'], ['learning', 'l', 'd', 'dat', 'data', 'data mining', 'a m', 'python']]

我希望输出为

[["data mining", "machine learning"], ["python", "programming language"], ["learning", "data mining", "python"]]

我当前程序中的问题是它无法区分诸如d，dat，data和data mining之类的重叠概念，而仅获得{{1 }}作为输出。

我对使用正则表达式模式不感兴趣，因为它会减慢该过程的速度。

请让我知道是否需要更多详细信息。

Answer 1

正则表达式在这里工作。首先，由于re模块不支持重叠匹配，因此请先将按长度顺序递减的概念列表排序，然后再将其转换为正则表达式。然后，当您使用re.findall时，最长的单词将始终被首先匹配。

import re

r = sorted(selected_concepts, key=len, reverse=True)
rgx = '|'.join([fr'\b{word}\b' for word in r])

[re.findall(rgx, sentence) for sentence in sentences]

[['data mining', 'machine learning'],
 ['python', 'programming language'],
 ['learning', 'data mining', 'python']]

Answer 2

如果我正确理解了您的问题，您是不想包含已经包含的更长“概念”中的“概念”吗？

Regex实际上可以非常有效，并且可能比您编写的解决方案要快。但是，只需添加以下行即可解决共享的解决方案：

output = [[w1 for w1 in l if not any([w2 != w1 and w2.find(w1) >= 0 for w2 in l])] for l in output]

但这并不是很有效，因为它仍然会找到所有解决方案，然后运行相当昂贵的操作来删除较长结果中包含的所有重复项。

仅按长度对列表进行排序（使用正则表达式或其他方式）将不起作用，因为子字符串可能是多个较长字符串的一部分，并且如果在这些较长字符串之外找到它们，仍应找到它们。例如，如果selected_concepts类似于["big example", "small example", "example", "small", "big"]。然后，运行句子"this big example has a small solution for example"仍应找到["big example", "small", "example"]。

但是，您的代码还有更多问题，因为它忽略了您只需要查找整个单词概念的要求。在您的示例中，如果将"v"添加为一个概念，则会在love中找到它，并且不会将其作为另一个概念的一部分而消除。另外，我自己提供的那行消除了那些既单独出现又作为较大概念的一部分出现的概念。

更好，更完整的解决方案（仍然没有正则表达式）：

sentences = ["i love data mining and machine learning", "python is the best programming language",
             "the learning process of data mining and python are awesome"]
selected_concepts = ["learning", "python", "programming language", "d", "dat", "data", "data mining", "a m",
                     "machine learning", "l"]

split_sentences = [s.split() for s in sentences]
split_selected_concepts = [s.split() for s in sorted(selected_concepts, key=len, reverse=True)]

sentence_concepts = []
for s in split_sentences:
    concepts = []
    for c in split_selected_concepts:
        new_s = []
        i = 0
        while i < len(s):
            # if concept is found
            if s[i:i + len(c)] == c:
                # save it and skip it, so it isn't found again
                concepts.append((i, c))
                # keep blanks in new_s to ensure correct index for further results
                new_s.extend(len(c) * [None])
                i += len(c)
            else:
                # if the current word doesn't start this concept, keep it
                new_s.append(s[i])
                i += 1
        s = new_s
    # reorder the found concepts and turn the lists back into strings
    sentence_concepts.append([' '.join(x[1]) for x in sorted(concepts, key=lambda x: x[0])])

print(sentence_concepts)

如何从python中的字符串中提取最长的单词

2 个答案: