我们假设我有以下列表(实际上他们有很多子列表):
list_1 = [['Hi my name is anon'],
['Hi I like #hokey']]
list_2 = [['Hi my name is anon_2'],
['Hi I like #Basketball']]
我想计算所有可能的配对的distance没有重复(没有替换的组合,产品?)。例如:
distance between: ['Hi my name is anon'] and ['Hi my name is anon_2']
distance between: ['Hi my name is anon'] and ['Hi I like #Basketball']
distance between: ['Hi I like #hokey'] and ['Hi my name is anon_2']
distance between: ['Hi I like #hokey'] and ['Hi I like #Basketball']
将分数放入这样的列表中:
[distance_1,distance_2,distance_3,distance_4]
为此,我考虑使用itertools product或combination。这就是我试过的:
strings_1 = [i[0] for i in list_1]
strings_2 = [i[0] for i in list_2]
import itertools
scores_list = [dis.jaccard(i,j) for i,j in zip(itertools.combinations(strings_1, strings_2))]
问题是我得到了这个追溯:
scores_list = [dis.jaccard(i,j) for i,j in zip(itertools.combinations(strings_1, strings_2))]
TypeError: an integer is required
如何才能有效地完成这项任务?如何计算这种类似产品组合的操作呢?
答案 0 :(得分:2)
您需要使用itertools.product
来获取笛卡尔积,如此
[dis.jaccrd(string1, string2) for string1, string2 in product(list_1, list_2)]
该产品会对这些项目进行分组,例如
>>> from pprint import pprint
>>> pprint(list(product(list_1, list_2)))
[(['Hi my name is anon'], ['Hi my name is anon_2']),
(['Hi my name is anon'], ['Hi I like #Basketball']),
(['Hi I like #hokey'], ['Hi my name is anon_2']),
(['Hi I like #hokey'], ['Hi I like #Basketball'])]
如果您只想将jaccrd
函数应用于列表中的字符串,那么您可能需要预处理列表,例如
>>> list_11 = [item for items in list_1 for item in items]
>>> list_21 = [item for items in list_2 for item in items]
>>> pprint([str1 + " " + str2 for str1, str2 in product(list_11, list_21)])
['Hi my name is anon Hi my name is anon_2',
'Hi my name is anon Hi I like #Basketball',
'Hi I like #hokey Hi my name is anon_2',
'Hi I like #hokey Hi I like #Basketball']
>>> pprint([dis.jaccard(str1, str2) for str1, str2 in product(list_11, list_21)])
...
...
根据Ashwini在评论中的建议,对于您的情况,您可以直接使用itertools.starmap
,就像这样
>>> from itertools import product, starmap
>>> list(starmap(dis.jaccrd, product(list_11, list_21)))
例如,
>>> list_1 = ["a1", "a2", "a3"]
>>> list_2 = ["b1", "b2", "b3"]
>>> from itertools import product, starmap
>>> list(starmap(lambda x, y: x + " " + y, product(list_1, list_2)))
['a1 b1', 'a1 b2', 'a1 b3', 'a2 b1', 'a2 b2', 'a2 b3', 'a3 b1', 'a3 b2', 'a3 b3']
答案 1 :(得分:1)
product
有效,但由于你只有配对,这也有效:
[dis.jaccard(string1, string2) for string1 in list_1 for string2 in list_2]
也就是说,当然starmap
+ product
组合获胜。