给出一个pd.DataFrame
,例如:
to_remove pred_0 .... pred_10
0 ['apple'] ['apple','abc'] .... ['apple','orange']
1 ['cd','sister'] ['uncle','cd'] .... ['apple']
在每一行中,如果该元素显示在同一行的pred_0
中,我想删除pred_10
... to_remove
中的元素。
在此示例中,答案应为:
to_remove pred_0 .... pred_10
0 ['apple'] ['abc'].... ['orange'] # remove 'apple' this row
1 ['cd','sister'] ['uncle']....['apple'] # remove 'cd' and 'sister' this row
我想知道如何关联代码。
要生成示例df:
from collections import OrderedDict
D=pd.DataFrame(OrderedDict({'to_remove':[['apple'],['cd','sister']],'pred_0':[['apple','abc'],['uncle','cd']],'pred_1':[['apple','orange'],['apple']]}))
答案 0 :(得分:1)
您可以尝试逐行迭代并过滤该列中未指定的元素
考虑的数据框
pred_0 pred_10 to_remove
0 [apple, abc] [apple, orage] [apple]
1 [uncle, cd] [apple] [cd, sister]
df.apply(lambda x: x[x.index.difference(['to_remove'])].apply(lambda y: [i for i in y if i not in x['to_remove']]),1)
出局:
pred_0 pred_10
0 [abc] [orage]
1 [uncle] [apple]
答案 1 :(得分:0)
您可以使用几个列表理解:
s = df['to_remove'].map(set)
for col in ['pred_0', 'pred_1']:
df[col] = [[i for i in L if i not in S] for L, S in zip(df[col], s)]
print(df)
to_remove pred_0 pred_1
0 [apple] [abc] [orange]
1 [cd, sister] [uncle] [apple]
列表推导可能比pd.DataFrame.apply
更有效,It is much better to implement Runnable
than to extends Thread
对于每行构造一个序列并将其传递给函数很昂贵。如您所见,您的需求并没有真正利用Pandas / NumPy。
因此,除非您有能力将列表扩展为一系列字符串,否则dict
+ list
可能是更合适的数据结构选择。