Question

考虑到两个词，我想确定它的常见部分。

例如，考虑到两个单词"technology learning TEL"和"learning TEL approach"，我想确定常用术语learning TEL。

另一个例子，lightweight web applications和software web applications，常见术语是web applications

我当前的代码使用in，如下所示。

for item1 in mylist_1:
    for item2 in mylist_2:
        if item2 in item1:
            tmp_mylist1.append(item2)
            break

print(tmp_mylist1)

但是，它无法识别隐含的单词短语，正如我在上面的示例中所提到的那样。

if "technology learning TEL" in "learning TEL approach":
    print("done")
else:
    print("no")

因此，有没有最快的方法在python中识别这些隐含的常见连续词？

Answer 1

肯定存在更快的方法，但是因为没有人在这里回复是一个解决方案：

import itertools

def best_combination(string1, string2):
    '''
    Gives best words combinations within both strings
    '''
    words = string1.split()
    # All possible solutions for a case
    solutions = []

    # Loop to increment number of words combination to test
    for i in range(1, len(words) + 1):
        # get all possible combinations according to current number of words to test
        possibilities = list(itertools.combinations(words, i))

        # test all possiblities
        for possibility in possibilities:
            tested_string = ' '.join(possibility)

            # If it match, add it to solutions list
            if tested_string in string2:
                solutions.append(tested_string)

    # Best solution is the longest
    solutions.sort(key=len, reverse=True)
    return solutions[0]


print(best_combination('technology learning TEL', 'learning TEL approach'))
print(best_combination('aaa bbb ccc', 'bbb ccc'))
print(best_combination('aaa bbb ccc', 'aaa bbb ccc'))
print(best_combination('aaa bbb ccc', 'ccc bbb'))

输出：

learning TEL
bbb ccc
aaa bbb ccc
bbb

More about itertools.combinations

修改

同样的事情，更少的线条，更多的单线：

def best_combination(string1, string2): ''' Gives best words combinations within both strings ''' words = string1.split() solutions = [] tests = sum([list(itertools.combinations(words, i)) for i in range(1, len(words) + 1)], []) for test in tests: if ' '.join(test) in string2: solutions.append(' '.join(test)) solutions.sort(key=len, reverse=True) return solutions[0]

Answer 2

我使用了这种方法并且有效：

def AnalyzeTwoExpr(expr1,  expr2): #Case sensitive
    commonExpr = []
    a = expr1.split(' ') #splits each expression into an array of words
    b = expr2.split(' ') #splits each expression into an array of words
    for word1 in a:
        for word2 in b:
            if(word1 == word2):
                commonExpr.append(word1)

return commonExpr

此方法返回一个数组，其中包含两个表达式中包含的所有单词。此方法有2个必需参数，2个字符串，这是要分析的2个表达式。

此外，还有一个不区分大小写的方法：

def AnalyzeTwoExpr(expr1,  expr2): #Not case sensitive
    commonExpr = []
    a = expr1.split(" ")
    b = expr2.split(" ")
    for word1 in a:
        for word2 in b:
            w1 = word1.lower()
            w2 = word2.lower()
            if(w1 == w2):
                commonExpr.append(w1)

return commonExpr

希望这适合你。

检测python

2 个答案: