我有这个数据集:
import numpy as np
import pandas as pd
from itertools import product
A= ['ABC', 'DEF']
M= ['X', 'Y', 'Z']
F= ['plus', 'minus', 'star']
# Create all possible permutation from <A,M,F>
df = pd.DataFrame(list(product(A,M,F)), columns=['A', 'M', 'F'])
df['value'] = np.random.uniform(0, 1, df.shape[0])
数据集如下:
A M F value
0 ABC X plus 0.666602
1 ABC X minus 0.716765
2 ABC X star 0.032931
3 ABC Y plus 0.275616
4 ABC Y minus 0.489233
在这里,我希望获得最大化目标的前k组合:
My goal is : The maximum of Sum(values of combination sets) + sum(distance of combination sets)
这是我的代码:
#diversity/distance function
def diversity(a, b):
c = a.intersection(b)
d = float(len(c)) / (len(a) + len(b) - len(c))
return 1 - d
我的代码:
from itertools import combinations
k = 3
max_distance = []
# I drop the column 'value' because sets that I want to compare is <A,M,F>
df_distance = df.drop(['value'],axis=1)
series_set = df_distance.apply(lambda row: set(row), axis=1)
data = series_set
for z in combinations(data, k):
dis = 0
sum_values = 0
for a in combinations(z, 2):
dis += diversity(*a)
# I am stuck here, I want to sum the value but I don't know, how to get the value and sum it in combination
max_distance.append((dis, tuple(z)))
max_distance.sort(key=lambda x: x[0], reverse=True)
print(max_distance[:k])
输出:
[(2.8, ({'plus', 'ABC', 'X'}, {'Y', 'minus', 'ABC'}, {'Z', 'star', 'DEF'})), (2.8, ({'plus', 'ABC', 'X'}, {'Y', 'star', 'ABC'}, {'Z', 'minus', 'DEF'})), (2.8, ({'plus', 'ABC', 'X'}, {'Z', 'minus', 'ABC'}, {'Y', 'star', 'DEF'}))]
在上面的代码中,我只是计算距离的总和。值2.8只是距离的总和。我想对集合之间的距离求和,但只能从列[A,M,F]中求和,我也想对这些值求和。预期输出是(所有距离的总和+值之和)所有组合的最佳值。
我真的陷入如何总结组合中的值?
预期产出:
[(sum(distance) + sum(values) , ({'plus', 'ABC', 'X'}, {'Y', 'minus', 'ABC'}, {'Z', 'star', 'DEF'})), ((sum(distance) + sum(values), ({'plus', 'ABC', 'X'}, {'Y', 'star', 'ABC'}, {'Z', 'minus', 'DEF'})), ((sum(distance) + sum(values), ({'plus', 'ABC', 'X'}, {'Z', 'minus', 'ABC'}, {'Y', 'star', 'DEF'}))]
如果您有疑问,请告诉我,对不起我的英语。
答案 0 :(得分:1)
请参阅下面的代码略微修改版本。我认为是你想要的。我基本上将你的set
强制转换为多样性函数,以便series_set
可以成为一个元组。然后,该元组可用于使用多索引切片DataFrame。
import numpy as np
import pandas as pd
from itertools import product, combinations
A = ['ABC', 'DEF']
M = ['X', 'Y', 'Z']
F = ['plus', 'minus', 'star']
# Create all possible permutation from <A,M,F>
df = pd.DataFrame(list(product(A,M,F)), columns=['A', 'M', 'F'])
df['value'] = np.random.uniform(0, 1, df.shape[0])
# diversity/distance function
def diversity(a, b):
c = set(a).intersection(b)
d = float(len(c)) / (len(a) + len(b) - len(c))
return 1 - d
k = 3
max_distance = []
max_values = []
# I drop the column 'value' because sets that I want to compare is <A,M,F>
df_distance = df.drop(['value'],axis=1)
df_sum = df.set_index(['A', 'M', 'F'])
series_set = df_distance.apply(lambda row: tuple(row), axis=1)
data = series_set
for z in combinations(data, k):
dis = 0
sum_values = 0
for a in combinations(z, 2):
dis += diversity(*a)
sum_values += df_sum.ix[a[0], 'value'] + df_sum.ix[a[1], 'value']
max_distance.append((dis, tuple(z)))
max_values.append((sum_values, tuple(z)))
max_distance.sort(key=lambda x: x[0], reverse=True)
print(max_distance[:k])
max_values.sort(key=lambda x: x[0], reverse=True)
print(max_values[:k])
- 更新 -
max_total = []
for z in combinations(data, k):
dis = 0
sum_values = 0
for a in combinations(z, 2):
dis += diversity(*a)
sum_values += df_sum.loc[a[0], 'value'] + df_sum.loc[a[1], 'value']
total_sum = dis + sum_values
max_total.append((total_sum, tuple(z)))
max_total.sort(key=lambda x: x[0], reverse=True)
print(max_total[:k])