有人可以帮我解决这个问题吗?我有两个字符串列表,可能长度不一样。我需要使用文本相似性方法的最大分数或余弦jacckard相似性将列表'A'中的字符串映射到列表'B'中的一个且仅一个字符串。
示例如下:
A = ['I love in eating apple every Tuesday','I went to the bank to withdraw money','Is python a snake or a programming language']
B = ['Apple is good for your health, endeavour to eat one once a week', 'I bank with North-West bank located at apple street where I withdraw money every time','Python programming is interesting','I am a good chef and eating is my hobby']
我想要的结果如下:
{'I love in eating apple every Tuesday':'Apple is good for your health, endeavor to eat one once a week',I went to the bank to withdraw money':I bank with North-West bank located at apple street where I withdraw money every time','Is python a snake or a programming language':'Python programming is interesting'}
请注意,当长度不同时,匹配词最少的字符串不匹配。
感谢。
@Megalng我所说的是映射不是基于匹配字符串中的重叠字来完成的。
import re, math
from collections import Counter
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def vector(text):
return Counter(text) result = {}
for s1 in A:
s2 = max(B, key=lambda x:cosine_sim(vector(s1),vector(x)))
B.remove(s2)
result[s1] = s2
print(result)
答案 0 :(得分:0)
所以你有一个函数similarity(s1, s2)
返回一个数字?
如果是这种情况,您应该能够做到这样的事情:
A = ['I love in eating apple every Tuesday', 'I went to the bank to withdraw money', 'Is python a snake or a programming language']
B = ['Apple is good for your health, endeavor to eat one once a week', 'I bank with North-West bank located at apple street where I withdraw money every time',
'Python programming is interesting', 'I am a good chef and eating is my hobby']
result = {}
for s1 in A:
s2 = max(B, key=lambda x:similarity(s1,x))
B.remove(s2)
result[s1] = s2
print(result)