Question

我有一个df，其中有很多丢失的数据，但实际上是相同的列（源自合并数据集）。例如，请考虑以下内容：

temp = pd.DataFrame({"fruit_1": ["apple", "pear", "don't want to tell", np.nan, np.nan, np.nan],
                     "fruit_2": [np.nan, np.nan, "don't want to tell", "apple", "don't want to tell", np.nan],
                     "fruit_3": ["apple", np.nan, "pear", "don't want to tell", np.nan, "pear"]})

我现在想将它们合并为一列；冲突应按以下方式解决：

np.nan始终被其他信息覆盖
“不想告诉”仅覆盖np.nan
其他任何值都只会覆盖np.nan和“不想告诉”（即保留第一个值）。

我尝试创建一个新列并使用apply（请参见下文）。

temp.insert(0, "fruit", np.nan)
temp['fruit'].apply(lambda row: row["fruit"] if np.isnan(row["fruit"]) and not np.isnan(row["fruit_1"]) else np.nan) # map col

但是，代码会产生一个TypeError: 'float' object is not subscriptable

有人可以告诉我（1）这通常是否可行-如果是，我的错误是什么？并且（2）最有效的方法是什么？

非常感谢。

**编辑** 预期的输出是

                fruit             
0               apple         
1                pear       
2                pear  
3               apple             
4  don't want to tell
5                pear

Answer 1

带有ffill和附加的np.where

s=temp.mask(temp=="don't want to tell").bfill(1).iloc[:,0]
s=np.where((temp=="don't want to tell").any(1)&s.isnull(),"don't want to tell",s)
s
Out[17]: 
array(['apple', 'pear', 'pear', 'apple', "don't want to tell", 'pear'],
      dtype=object)
temp['New']=s
temp
Out[19]: 
              fruit_1  ...                 New
0               apple  ...               apple
1                pear  ...                pear
2  don't want to tell  ...                pear
3                 NaN  ...               apple
4                 NaN  ...  don't want to tell
5                 NaN  ...                pear
[6 rows x 4 columns]

熊猫根据其他细胞的顺序填充细胞

1 个答案: