Python根据频率填充缺失值

时间:2020-08-23 14:22:52

标签: python pandas numpy

我已经看到很多情况下缺失值用均值或中位数填充。我想知道如何用频率填充缺失值。

这是我的设置:

import numpy as np
import pandas as pd


df = pd.DataFrame({'sex': [1,1,1,1,0,0,np.nan,np.nan,np.nan]})
df['sex_fillna'] = df['sex'].fillna(df.sex.mode()[0])
print(df)
   sex  sex_fillna
0  1.0         1.0  We have 4 males
1  1.0         1.0
2  1.0         1.0
3  1.0         1.0
4  0.0         0.0  we have 2 females, so ratio is 2
5  0.0         0.0
6  NaN         1.0  Here, I want random choice of [1,1,0]  
7  NaN         1.0  eg. 1,1,0 or 1,0,1 or 0,1,1 randomly
8  NaN         1.0

有通用的方法可以做到吗?

我的尝试

df['sex_fillan2'] = df['sex'].fillna(np.random.randint(0,2)) # here the ratio is not guaranteed to approx 4/2 = 2

注意 此示例仅适用于二进制值,我一直在寻找具有两个以上类别的分类值。

例如:

class: A   B   C
       20% 40% 60%

然后,我不是按照类别C来填充所有nan,而是要根据频率计数来填充。

但是,这是个好主意吗?

根据一些评论,这可能是一个好主意,也可能不是为不同的行插入具有不同值的缺失值的好方法,如果您要提供一些输入或看这是一个好方法,我已经在CrossValidated中创建了一个问题想法请访问页面:https://stats.stackexchange.com/questions/484467/is-it-better-to-fillnans-based-on-frequency-rather-than-all-values-with-mean-or

3 个答案:

答案 0 :(得分:6)

value_counts + np.random.choice进行确认

s = df.sex.value_counts(normalize=True)
df['sex_fillna'] = df['sex']
df.loc[df.sex.isna(), 'sex_fillna'] = np.random.choice(s.index, p=s.values, size=df.sex.isna().sum())
df
Out[119]: 
   sex  sex_fillna
0  1.0         1.0
1  1.0         1.0
2  1.0         1.0
3  1.0         1.0
4  0.0         0.0
5  0.0         0.0
6  NaN         0.0
7  NaN         1.0
8  NaN         1.0

s索引的输出是类别,而值是概率

s
Out[120]: 
1.0    0.666667
0.0    0.333333
Name: sex, dtype: float64

答案 1 :(得分:4)

如果您的栏中有两个以上有效值,则通常的答案是查找分布并根据该分布进行填充。例如,

page_r = requests.get(page_url)
page_soup = BeautifulSoup(page_r.content, 'html.parser')
elements = page_soup.find('body').find_all()
for element in elements:
    print("CSS selector for this element")
    # here I want to print full CCS selector like body>section>div:nth-of-type(3)>p:nth-of-type(4)

然后获取缺少值的行

dist = df.sex.value_counts(normalize=True)
print(list)
1.0    0.666667
0.0    0.333333
Name: sex, dtype: float64

最后,根据上述分布用随机选择的值填充这些行

nan_rows = df['sex'].isnull()

答案 2 :(得分:4)

使用

import numpy as np

categories = ["A", "B", "C"]
weights = [0.2, 0.4, 0.6]

def choose_k(k, categories, weights):
    return [np.random.choice(categories, weights) for _ in range(k)]

或(速度较慢,但​​没有其他依赖项):

from random import choices

def choose_k(k, categories, weights):
    return [choices(categories, weights) for _ in range(k)]