考虑到两个词,我想确定它的常见部分。
例如,考虑到两个单词"technology learning TEL"
和"learning TEL approach"
,我想确定常用术语learning TEL
。
另一个例子,lightweight web applications
和software web applications
,常见术语是web applications
我当前的代码使用in
,如下所示。
for item1 in mylist_1:
for item2 in mylist_2:
if item2 in item1:
tmp_mylist1.append(item2)
break
print(tmp_mylist1)
但是,它无法识别隐含的单词短语,正如我在上面的示例中所提到的那样。
if "technology learning TEL" in "learning TEL approach":
print("done")
else:
print("no")
因此,有没有最快的方法在python中识别这些隐含的常见连续词?
答案 0 :(得分:2)
肯定存在更快的方法,但是因为没有人在这里回复是一个解决方案:
import itertools
def best_combination(string1, string2):
'''
Gives best words combinations within both strings
'''
words = string1.split()
# All possible solutions for a case
solutions = []
# Loop to increment number of words combination to test
for i in range(1, len(words) + 1):
# get all possible combinations according to current number of words to test
possibilities = list(itertools.combinations(words, i))
# test all possiblities
for possibility in possibilities:
tested_string = ' '.join(possibility)
# If it match, add it to solutions list
if tested_string in string2:
solutions.append(tested_string)
# Best solution is the longest
solutions.sort(key=len, reverse=True)
return solutions[0]
print(best_combination('technology learning TEL', 'learning TEL approach'))
print(best_combination('aaa bbb ccc', 'bbb ccc'))
print(best_combination('aaa bbb ccc', 'aaa bbb ccc'))
print(best_combination('aaa bbb ccc', 'ccc bbb'))
输出:
learning TEL
bbb ccc
aaa bbb ccc
bbb
More about itertools.combinations
修改强>
同样的事情,更少的线条,更多的单线:
def best_combination(string1, string2):
'''
Gives best words combinations within both strings
'''
words = string1.split()
solutions = []
tests = sum([list(itertools.combinations(words, i)) for i in range(1, len(words) + 1)], [])
for test in tests:
if ' '.join(test) in string2:
solutions.append(' '.join(test))
solutions.sort(key=len, reverse=True)
return solutions[0]
答案 1 :(得分:0)
我使用了这种方法并且有效:
def AnalyzeTwoExpr(expr1, expr2): #Case sensitive
commonExpr = []
a = expr1.split(' ') #splits each expression into an array of words
b = expr2.split(' ') #splits each expression into an array of words
for word1 in a:
for word2 in b:
if(word1 == word2):
commonExpr.append(word1)
return commonExpr
此方法返回一个数组,其中包含两个表达式中包含的所有单词。此方法有2个必需参数,2个字符串,这是要分析的2个表达式。
此外,还有一个不区分大小写的方法:
def AnalyzeTwoExpr(expr1, expr2): #Not case sensitive
commonExpr = []
a = expr1.split(" ")
b = expr2.split(" ")
for word1 in a:
for word2 in b:
w1 = word1.lower()
w2 = word2.lower()
if(w1 == w2):
commonExpr.append(w1)
return commonExpr
希望这适合你。