我有一个pandas
数据框,其中包含各列的地址字段。我的问题是,在两列中,行中有重复的单元格值。有人知道当两列中有重复项时,如何有条件地更改一列的值吗?理想情况下,我想保留一个值,并将另一个值设置为np.nan
。
这是一个测试用例:
import pandas as pd
test = pd.read_json('{"housename":{"16":null,"17":null,"18":null},"name":{"16":"Shoecare","17":"33","18":"33A"},"house_number":{"16":"32","17":"33","18":"33A"},"street":{"16":"Carfax","17":"Carfax","18":"Carfax"},"city":{"16":"Horsham","17":"Horsham","18":"Horsham"},"postcode":{"16":"RH12 1EE","17":"RH12 1EE","18":"RH12 1EE"}}')
city house_number housename name postcode street
16 Horsham 32 NaN Shoecare RH12 1EE Carfax
17 Horsham 33 NaN 33 RH12 1EE Carfax
18 Horsham 33A NaN 33A RH12 1EE Carfax
在测试用例上,我玩过test.duplicated(subset=['house_number', 'name'])
,但是它不会在house_number
和name
列中标识重复的值。
有人对如何首先识别两列中重复的单元格,然后将一个值设置为np.nan
有任何建议吗?
所需的输出:
housename name house_number street city postcode
16 NaN Shoecare 32 Carfax Horsham RH12 1EE
17 NaN NaN 33 Carfax Horsham RH12 1EE
18 NaN NaN 33A Carfax Horsham RH12 1EE
答案 0 :(得分:2)
如果2列分别为house_number
和name
,则可以按照以下方式进行操作:
test['name'] = np.where((test['house_number'] == test['name']), np.nan, test['name'])
输出:
city house_number housename name postcode street
16 Horsham 32 NaN Shoecare RH12 1EE Carfax
17 Horsham 33 NaN NaN RH12 1EE Carfax
18 Horsham 33A NaN NaN RH12 1EE Carfax