检测python

时间:2017-12-28 10:46:47

标签: python

考虑到两个词,我想确定它的常见部分。

例如,考虑到两个单词"technology learning TEL""learning TEL approach",我想确定常用术语learning TEL

另一个例子,lightweight web applicationssoftware web applications,常见术语是web applications

我当前的代码使用in,如下所示。

for item1 in mylist_1:
    for item2 in mylist_2:
        if item2 in item1:
            tmp_mylist1.append(item2)
            break

print(tmp_mylist1)

但是,它无法识别隐含的单词短语,正如我在上面的示例中所提到的那样。

if "technology learning TEL" in "learning TEL approach":
    print("done")
else:
    print("no")

因此,有没有最快的方法在python中识别这些隐含的常见连续词?

2 个答案:

答案 0 :(得分:2)

肯定存在更快的方法,但是因为没有人在这里回复是一个解决方案:

import itertools

def best_combination(string1, string2):
    '''
    Gives best words combinations within both strings
    '''
    words = string1.split()
    # All possible solutions for a case
    solutions = []

    # Loop to increment number of words combination to test
    for i in range(1, len(words) + 1):
        # get all possible combinations according to current number of words to test
        possibilities = list(itertools.combinations(words, i))

        # test all possiblities
        for possibility in possibilities:
            tested_string = ' '.join(possibility)

            # If it match, add it to solutions list
            if tested_string in string2:
                solutions.append(tested_string)

    # Best solution is the longest
    solutions.sort(key=len, reverse=True)
    return solutions[0]


print(best_combination('technology learning TEL', 'learning TEL approach'))
print(best_combination('aaa bbb ccc', 'bbb ccc'))
print(best_combination('aaa bbb ccc', 'aaa bbb ccc'))
print(best_combination('aaa bbb ccc', 'ccc bbb'))

输出:

learning TEL
bbb ccc
aaa bbb ccc
bbb

More about itertools.combinations

修改

同样的事情,更少的线条,更多的单线:

def best_combination(string1, string2):
    '''
    Gives best words combinations within both strings
    '''
    words = string1.split()
    solutions = []

    tests = sum([list(itertools.combinations(words, i)) for i in range(1, len(words) + 1)], [])
    for test in tests:
        if ' '.join(test) in string2:
            solutions.append(' '.join(test))
    solutions.sort(key=len, reverse=True)
    return solutions[0]

答案 1 :(得分:0)

我使用了这种方法并且有效:

def AnalyzeTwoExpr(expr1,  expr2): #Case sensitive
    commonExpr = []
    a = expr1.split(' ') #splits each expression into an array of words
    b = expr2.split(' ') #splits each expression into an array of words
    for word1 in a:
        for word2 in b:
            if(word1 == word2):
                commonExpr.append(word1)

return commonExpr

此方法返回一个数组,其中包含两个表达式中包含的所有单词。此方法有2个必需参数,2个字符串,这是要分析的2个表达式。

此外,还有一个不区分大小写的方法:

def AnalyzeTwoExpr(expr1,  expr2): #Not case sensitive
    commonExpr = []
    a = expr1.split(" ")
    b = expr2.split(" ")
    for word1 in a:
        for word2 in b:
            w1 = word1.lower()
            w2 = word2.lower()
            if(w1 == w2):
                commonExpr.append(w1)

return commonExpr

希望这适合你。