我已经看到很多情况下缺失值用均值或中位数填充。我想知道如何用频率填充缺失值。
这是我的设置:
import numpy as np
import pandas as pd
df = pd.DataFrame({'sex': [1,1,1,1,0,0,np.nan,np.nan,np.nan]})
df['sex_fillna'] = df['sex'].fillna(df.sex.mode()[0])
print(df)
sex sex_fillna
0 1.0 1.0 We have 4 males
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 0.0 0.0 we have 2 females, so ratio is 2
5 0.0 0.0
6 NaN 1.0 Here, I want random choice of [1,1,0]
7 NaN 1.0 eg. 1,1,0 or 1,0,1 or 0,1,1 randomly
8 NaN 1.0
有通用的方法可以做到吗?
我的尝试
df['sex_fillan2'] = df['sex'].fillna(np.random.randint(0,2)) # here the ratio is not guaranteed to approx 4/2 = 2
注意 此示例仅适用于二进制值,我一直在寻找具有两个以上类别的分类值。
例如:
class: A B C
20% 40% 60%
然后,我不是按照类别C
来填充所有nan,而是要根据频率计数来填充。
根据一些评论,这可能是一个好主意,也可能不是为不同的行插入具有不同值的缺失值的好方法,如果您要提供一些输入或看这是一个好方法,我已经在CrossValidated中创建了一个问题想法请访问页面:https://stats.stackexchange.com/questions/484467/is-it-better-to-fillnans-based-on-frequency-rather-than-all-values-with-mean-or
答案 0 :(得分:6)
用value_counts
+ np.random.choice
进行确认
s = df.sex.value_counts(normalize=True)
df['sex_fillna'] = df['sex']
df.loc[df.sex.isna(), 'sex_fillna'] = np.random.choice(s.index, p=s.values, size=df.sex.isna().sum())
df
Out[119]:
sex sex_fillna
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 0.0 0.0
5 0.0 0.0
6 NaN 0.0
7 NaN 1.0
8 NaN 1.0
s
索引的输出是类别,而值是概率
s
Out[120]:
1.0 0.666667
0.0 0.333333
Name: sex, dtype: float64
答案 1 :(得分:4)
如果您的栏中有两个以上有效值,则通常的答案是查找分布并根据该分布进行填充。例如,
page_r = requests.get(page_url)
page_soup = BeautifulSoup(page_r.content, 'html.parser')
elements = page_soup.find('body').find_all()
for element in elements:
print("CSS selector for this element")
# here I want to print full CCS selector like body>section>div:nth-of-type(3)>p:nth-of-type(4)
然后获取缺少值的行
dist = df.sex.value_counts(normalize=True)
print(list)
1.0 0.666667
0.0 0.333333
Name: sex, dtype: float64
最后,根据上述分布用随机选择的值填充这些行
nan_rows = df['sex'].isnull()
答案 2 :(得分:4)
使用
import numpy as np
categories = ["A", "B", "C"]
weights = [0.2, 0.4, 0.6]
def choose_k(k, categories, weights):
return [np.random.choice(categories, weights) for _ in range(k)]
或(速度较慢,但没有其他依赖项):
from random import choices
def choose_k(k, categories, weights):
return [choices(categories, weights) for _ in range(k)]