我的数据集中有两列,如下所示。我要从所有“喜欢”的组合中仅选择一种组合。对于这种情况,(橙色,水果)和(水果,橙色)是等效的,所以我只需要其中之一。另外,既然水果已经映射为橙色,那么我不再需要任何水果了。所以基本上(水果,红色)会变成(橙色,红色)
C1 C2
orange fruit
orange color
orange apple
apple red
apple fruit
fruit red
fruit apple
fruit mango
fruit orange
这是我在Python中尝试过的代码
# Convert data frame to set of tuples
l = []
for i,x in df.iterrows():
l.append((x['C1'],x['C2']))
s_comb = set(l)
# Set of unique values from C1
s = set(list(df['C1']))
#Initialize x with first element of s
x = list(df['C1'])[0]
x=[x]
# Code for creating combinations
for i in s:
if i not in x:
for j in x:
if (i,j) not in s_comb:
x.append(i)
预期输出:
C1 C2
orange fruit
orange color
orange apple
orange red
orange mango
当前代码花费的时间很长,我不确定代码输出的准确性。
答案 0 :(得分:4)
对于问题的第一部分,您可以执行以下操作:
df['C'] = df.apply(lambda x: (str(set(x[['C1', 'C2']]))), axis=1)
df = df.drop_duplicates(subset='C')[['C1', 'C2']]
对于第二部分,您可以执行类似的操作:
df['Cmin'] = df.apply(lambda x: min(x[['C1', 'C2']]), axis=1)
df = df.drop_duplicates(subset='Cmin')[['C1', 'C2']]
df['Cmax'] = df.apply(lambda x: max(x[['C1', 'C2']]), axis=1)
df = df.drop_duplicates(subset='Cmax')[['C1', 'C2']]