基于文本相似性得分在列表中映射一对python字符串

时间:2018-04-17 07:57:25

标签: python string mapping

有人可以帮我解决这个问题吗?我有两个字符串列表,可能长度不一样。我需要使用文本相似性方法的最大分数或余弦jacckard相似性将列表'A'中的字符串映射到列表'B'中的一个且仅一个字符串。

示例如下:

A = ['I love in eating apple every Tuesday','I went to the bank to withdraw money','Is python a snake or a programming language']
B = ['Apple is good for your health, endeavour to eat one once a week', 'I bank with North-West bank located at apple street where I withdraw money every time','Python programming is interesting','I am a good chef and eating is my hobby']

我想要的结果如下:

{'I love in eating apple every Tuesday':'Apple is good for your health, endeavor to eat one once a week',I went to the bank to withdraw money':I bank with North-West bank located at apple street where I withdraw money every time','Is python a snake or a programming language':'Python programming is interesting'}

请注意,当长度不同时,匹配词最少的字符串不匹配。

感谢。

@Megalng我所说的是映射不是基于匹配字符串中的重叠字来完成的。

import re, math
from collections import Counter

def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection]) 
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def vector(text):
    return Counter(text)                                                                                                          result = {}
for s1 in A:

    s2 = max(B, key=lambda x:cosine_sim(vector(s1),vector(x)))
    B.remove(s2)
    result[s1] = s2
print(result)

1 个答案:

答案 0 :(得分:0)

所以你有一个函数similarity(s1, s2)返回一个数字? 如果是这种情况,您应该能够做到这样的事情:

A = ['I love in eating apple every Tuesday', 'I went to the bank to withdraw money', 'Is python a snake or a programming language']
B = ['Apple is good for your health, endeavor to eat one once a week', 'I bank with North-West bank located at apple street where I withdraw money every time',
     'Python programming is interesting', 'I am a good chef and eating is my hobby']
result = {}
for s1 in A:
    s2 = max(B, key=lambda x:similarity(s1,x))
    B.remove(s2)
    result[s1] = s2
print(result)