我有一个如下所示的数据集:
Id City Color City_1 Color_1
123 Miami Nan Miami Nan
124 Miami nan Nan Miami
125 Seattle Nan Mexico Nan
126 Nan white Nan Yellow
127 Wash Nan Wash Nan
128 LA pink LA Pink
(重新创建):
from numpy import nan
import pandas as pd
df = pd.DataFrame.from_dict({'Id': {0: 123, 1: 124, 2: 125, 3: 126, 4: 127, 5: 128},
'City': {0: 'Miami', 1: 'Miami', 2: 'Seattle', 3: nan, 4: 'Wash', 5: 'LA'},
'Color': {0: 'Nan', 1: nan, 2: nan, 3: 'white', 4: nan, 5: 'pink'},
'City_1': {0: 'Miami', 1: nan, 2: 'Mexico', 3: nan, 4: 'Wash', 5: 'LA'},
'Color_1': {0: nan, 1: 'Miami', 2: nan, 3: 'Yellow', 4: nan, 5: 'Pink'}})
我必须比较列,忽略 Nans 并向数据集添加相同/不同的列 - 稍后需要输出相同和不同的计数
输出数据集应该是这样的
Id City Color City_1 Color_1 Result
123 Miami Nan Miami Nan Same
124 Miami nan Nan Miami Different
125 Seattle Nan Mexico Nan Different
126 Nan white Nan Yellow Different
127 Wash Nan Wash Nan Same
128 LA pink LA Pink Same
想知道如何比较,忽略Nans
答案 0 :(得分:4)
Nans 有一些令人惊讶的属性,例如 bool(np.nan == np.nan) = False
- 这可能是您遇到的问题。
如果您希望它们的计算结果相等,您可以将它们转换为字符串或使用 fillna
并在任何地方用相同的值填充它们。由于其他响应涵盖了 fillna
路线,因此我将在此处转换为字符串:
df["Result"] = ((df.City.astype(str) == df.City_1.astype(str)) & (df.Color.astype(str).str.lower() == df.Color_1.astype(str).str.lower())).map({True:"Same", False:"Different"})
结果:
Id City Color City_1 Color_1 Result
0 123 Miami Nan Miami NaN Same
1 124 Miami NaN NaN Miami Different
2 125 Seattle NaN Mexico NaN Different
3 126 NaN white NaN Yellow Different
4 127 Wash NaN Wash NaN Same
5 128 LA pink LA Pink Same
对现有列的任何操作都不会就地发生,也不会被修改;只创建了 Result
。请注意,我已使用 Pink.lower() == pink
重现您的预期结果。
答案 1 :(得分:1)
首先将缺失值替换为相同的值,例如missing
然后比较小写值 - 如果只有 2 列可能使用 Series.str.lower
和 numpy.where
:
df1 = df.fillna('missing')
m = df1['City'].str.lower().eq(df1['City_1'].str.lower()) &
df1['Color'].str.lower().eq(df1['Color_1'].str.lower())
df['Result'] = np.where(m, 'Same','Different')
print (df)
Id City Color City_1 Color_1 Result
0 123 Miami NaN Miami NaN Same
1 124 Miami NaN NaN Miami Different
2 125 Seattle NaN Mexico NaN Different
3 126 NaN white NaN Yellow Different
4 127 Wash NaN Wash NaN Same
5 128 LA pink LA Pink Same
或者如果有多个列,如 City, City_1, City_2, City_N
使用通用解决方案:
f = lambda x: x.str.lower()
df11 = df.filter(like='City').apply(f).fillna('missing')
df22 = df.filter(like='Color').apply(f).fillna('missing')
m1 = df11.eq(df11.iloc[:, 0], axis=0).all(axis=1)
m2 = df22.eq(df22.iloc[:, 0], axis=0).all(axis=1)
df['Result'] = np.where(m1 & m2, 'Same','Different')
print (df)
Id City Color City_1 Color_1 Result
0 123 Miami NaN Miami NaN Same
1 124 Miami NaN NaN Miami Different
2 125 Seattle NaN Mexico NaN Different
3 126 NaN white NaN Yellow Different
4 127 Wash NaN Wash NaN Same
5 128 LA pink LA Pink Same