我有一个包含毫安行和大量NaN值的DataFrame。一些例子:
index Company Area
0 Google Technology
1 Coca Cola Drinks
2 NaN Drinks
3 Apple Technology
4 NaN Technology
5 Gatorade Drinks
6 Dell Technology
7 Apple Technology
8 Coca Cola Drinks
9 NaN Drinks
10 Google Technology
我的想法是使用其区域的2个最常见值之一填充公司NaN值。
从示例:如果技术领域最常见的公司是Apple和Google,我想填写“df ['Area'] =='技术'”NaN值与其中一个值(随机)
我已经创建了一个具有最常见值的Group By DataFrame,它是这样的:
Area Company
Technology Google
Technology Apple
Drinks Coca Cola
Drinks Pepsi
结果应该是这样的:
index Company Area
0 Google Technology
1 Coca Cola Drinks
2 Pepsi Drinks
3 Apple Technology
4 Google Technology
5 Gatorade Drinks
6 Dell Technology
7 Apple Technology
8 Coca Cola Drinks
9 Pepsi Drinks
10 Google Technology
我希望你能帮助我。
感谢!!!
答案 0 :(得分:0)
我使用random.choice
import random
s=df1.groupby('Area').Company.apply(list).reindex(df.Area).apply(lambda x :random.choice(x) )
s.index=df.index
df.Company=df.Company.fillna(s)
df
Out[200]:
index Company Area
0 0 Google Technology
1 1 CocaCola Drinks
2 2 CocaCola Drinks
3 3 Apple Technology
4 4 Google Technology
5 5 Gatorade Drinks
6 6 Dell Technology
7 7 Apple Technology
8 8 CocaCola Drinks
9 9 Pepsi Drinks
10 10 Google Technology
答案 1 :(得分:0)
import io
z=io.StringIO("""
Company Area
Google Technology
CocaCola Drinks
NaN Drinks
Apple Technology
NaN Technology
Gatorade Drinks
Dell Technology
Apple Technology
CocaCola Drinks
NaN Drinks
Google Technology""")
df = pd.read_table(z, delim_whitespace=True)
然后你可以做
t = df.groupby("Area").Company.value_counts()
s = t.groupby("Area").apply(lambda x: [(i[1]) for i,v in zip(x.index,x) if v==max(x)])
其中s
是具有最常见值的系列。例如:
>>> s
Area
Drinks [CocaCola]
Technology [Apple, Google]
Name: Company, dtype: object
现在使用random.choice
from random import choice
df2 = df.set_index("Area")
mask = df2.Company.isna()
df2.loc[mask, "Company"] = [choice(s[s.index == i].item()) for i in df2.loc[mask].index]