我想用“是”或“否”替换NaN值,具体取决于哪个计数基于“第一”列更大,如果它们相等则使其为“是”。例如,这是我的原始数据帧。
test = pd.DataFrame({'first':['a','a','b','c','b','c','a','c','b','a','b','c','c','d','d','d'],
'second':['yes','yes','no','no',np.nan,np.nan,'no','yes',np.nan,np.nan,'yes','no','no',np.nan,np.nan,np.nan]})
test = test.sort(['first'])
test
first second
1 a yes
6 a no
9 a NaN
0 a yes
4 b NaN
10 b yes
2 b no
8 b NaN
5 c NaN
3 c no
11 c no
12 c no
7 c yes
14 d NaN
15 d NaN
13 d NaN
我希望我的新数据框是这样的:
first second
1 a yes
6 a no
9 a yes
0 a yes
4 b yes
10 b yes
2 b no
8 b yes
5 c no
3 c no
11 c no
12 c no
7 c yes
14 d NaN
15 d NaN
13 d NaN
答案 0 :(得分:1)
这是一个选项。从测试框架开始
test = pd.DataFrame({'first':['a','a','b','c','b','c','a','c','b','a','b','c'],
'second':['yes','yes','no','no',np.nan,np.nan,'no','yes',np.nan,np.nan,'yes','no']})
test = test.sort(['first'])
test
first second
0 a yes
1 a yes
6 a no
9 a NaN
4 b NaN
10 b yes
8 b NaN
2 b no
3 c no
5 c NaN
11 c no
7 c yes
选项1
然后进行一些分组,然后进行排序以创建新的Dataframe(testCounts)。注意:我在第二个"第二个"因此,当计数相等时,将在组中首先出现。
s = test.groupby(['first',"second"])["first"].agg("count")
s.name = "count"
testCounts = s.reset_index().sort(["first","count","second"],ascending=[True,False,False])
testCounts
first second count
1 a yes 2
0 a no 1
3 b yes 1
2 b no 1
4 c no 2
5 c yes 1
然后我们使用布尔索引来过滤NaN。然后我们映射一个lambda函数,它接受我们的布尔索引testCounts的第一行
rowIndex = test["second"].isnull()
test.loc[rowIndex,"second"] = test["first"].map(lambda s :
testCounts[testCounts["first"] == s]["second"].iloc[0])
test
first second
0 a yes
1 a yes
6 a no
9 a yes
4 b yes
10 b yes
8 b yes
2 b no
3 c no
5 c no
11 c no
7 c yes
选项2。
从上面的框架开始,我们分组以获得类似于选项1的计数。接下来,我们通过对每个组进行排序,分组和获取第一行来创建一个字典
s = test.groupby(['first',"second"])["first"].agg("count")
s.name = "count"
d = s.reset_index().sort(["first","count","second"],ascending=[True,False,False])
.groupby("first").first()["second"].to_dict()
d
{'a': 'yes', 'b': 'yes', 'c': 'no'}
像之前一样的布尔索引,并将dict(d)映射到"第一个"
rowIndex = test["second"].isnull()
test.loc[rowIndex,"second"] = test["first"].map(d)
test
first second
0 a yes
1 a yes
6 a no
9 a yes
4 b yes
10 b yes
8 b yes
2 b no
3 c no
5 c no
11 c no
7 c yes
答案 1 :(得分:1)
def replace_na(first_value):
return test[test['first']==first_value]['second'].fillna(g[first_value].index[0])
pd.concat(map(replace_na,first_index))