我有 1 个数据框,并且想要检查然后仅当第二列中有值时才返回同一数据框的两列之间的值差异。下面示例中的第二列是 AppliancesO,第一列是 AppliancesH
Item Name AppliancesH AppliancesO
1 Joe TV TV
2 Mary [TV; Fridge] TV
3 Jack [Microwave;TV;Fridge] [Computer;TV;Fridge]
4 Pete [Fridge;Oven]
还有 1000 多行
我正在寻找的输出是
Item Name AppliancesH AppliancesO Diff
1 Joe TV TV
2 Mary [TV; Fridge] TV Fridge
3 Jack [Microwave;TV;Fridge] [Computer;TV;Fridge] [Microwave;Computer]
4 Pete [Fridge;Oven]
我知道如何比较列以确定它们是否不同,但我不知道如何返回差异
df.loc[(df['AppliancesH']!=df['AppliancesO'])& ~df.AppliancesO.isna()][['Name','AppliancesH', 'AppliancesO','Diff']]
答案 0 :(得分:1)
假设以下数据
>>> dict_ = {'AppliancesH': {1: ['TV'], 2: ['TV', 'Fridge'], 3: ['Microwave', 'TV', 'Fridge'], 4: ['Fridge', 'Oven']}, 'AppliancesO': {1: ['TV'], 2: ['TV'], 3: ['Computer', 'TV', 'Fridge'], 4: []}, 'Name': {1: 'Joe', 2: 'Mary', 3: 'Jack', 4: 'Pete'}}
>>> df = pd.DataFrame(dict_)
>>> df
AppliancesH AppliancesO Name
1 [TV] [TV] Joe
2 [TV, Fridge] [TV] Mary
3 [Microwave, TV, Fridge] [Computer, TV, Fridge] Jack
4 [Fridge, Oven] [] Pete
您可以使用 set
的 ~.symmetric_difference
来执行此类操作。让(首先定义我们需要的可调用对象:
def symdif(s: pd.Series) -> list:
h = s.AppliancesH
o = s.AppliancesO
return h and o and sorted(set(h).symmetric_difference(o))
使用它
>>> df['Diff'] = df.apply(axis=1, func=symdif)
>>> df
AppliancesH AppliancesO Name Diff
1 [TV] [TV] Joe []
2 [TV, Fridge] [TV] Mary [Fridge]
3 [Microwave, TV, Fridge] [Computer, TV, Fridge] Jack [Computer, Microwave]
4 [Fridge, Oven] [] Pete []
答案 1 :(得分:1)
这是另一种方式:
df['Differences'] = (df.set_index('Name')
.applymap(set)
.apply(lambda x: set.symmetric_difference(*x),axis=1).map(list)
.reset_index(drop=True))
答案 2 :(得分:0)
这也可以用异或运算符来完成
def find_diff(row):
if row.isna().any():
return []
diff = set(row['AppliancesH']) ^ set(row['AppliancesO'])
return list(diff)
df.apply(find_diff, axis=1)
您可能还需要编写一个将这些字符串转换为列表的函数