我是'R'程序的新手,目前希望处理缺失的值。 基本上,我的数据集只有几列,而“购买”列中缺少值。
我想根据“ Master_Category”列为缺失值估算“购买”值的平均值。
(Python代码)
# generate missing Purchase values
miss_Purch_rows = dataset['Purchase'].isnull()
# Check Purchase values from the grouping by the newly created Master_Product_Category column
categ_mean = dataset.groupby(['Master_Product_Category'])['Purchase'].mean()
# Impute mean Purchase value based on Master_Product_Category column
dataset.loc[miss_Purch_rows,'Purchase'] = dataset.loc[miss_Purch_rows,'Master_Product_Category'].apply(lambda x: categ_mean.loc[x])
我正在寻找“ R程序”中的类似代码,以通过均值并与另一列相关来估算缺失值。
数据集的样本数据如下;
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1 1000001 P00000142 F 0-17 10 0 345 13650
2 1000001 P00004842 F 0-17 10 0 3412 13645
3 1000001 P00025442 F 0-17 10 0 129 15416
4 1000001 P00051442 F 0-17 10 0 8170 9938
5 1000001 P00051842 F 0-17 10 0 480 2849
6 1000001 P00057542 F 0-17 10 0 345 NA
7 1000001 P00058142 F 0-17 10 0 3412 11051
8 1000001 P00058242 F 0-17 10 0 3412 NA
9 1000001 P00059442 F 0-17 10 0 6816 16622
10 1000001 P00064042 F 0-17 10 0 3412 8190
我尝试过;
with(dataset, sapply(X = Purchase, INDEX = Master_Category, FUN = mean, na.rm = TRUE))
但这似乎不起作用。
答案 0 :(得分:1)
通常可以通过 tidyverse 一组软件包来轻松进行此类按组操作:
首先,我们读取您的示例数据:
txt <- 'User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1000001 P00000142 F 0-17 10 0 345 13650
1000001 P00004842 F 0-17 10 0 3412 13645
1000001 P00025442 F 0-17 10 0 129 15416
1000001 P00051442 F 0-17 10 0 8170 9938
1000001 P00051842 F 0-17 10 0 480 2849
1000001 P00057542 F 0-17 10 0 345 NA
1000001 P00058142 F 0-17 10 0 3412 11051
1000001 P00058242 F 0-17 10 0 3412 NA
1000001 P00059442 F 0-17 10 0 6816 16622
1000001 P00064042 F 0-17 10 0 3412 8190'
df <- read.table(text = txt, header = T)
然后,我们按照“ Master_Category”分组,并使用NA
内的ifelse
用组均值填写任何mutate
值:
library(tidyverse)
df.new <- df %>%
group_by(Master_Category) %>%
mutate(Purchase = ifelse(is.na(Purchase), mean(Purchase, na.rm = T), Purchase))
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
<int> <fct> <lgl> <fct> <int> <int> <int> <dbl>
1 1000001 P00000142 FALSE 0-17 10 0 345 13650
2 1000001 P00004842 FALSE 0-17 10 0 3412 13645
3 1000001 P00025442 FALSE 0-17 10 0 129 15416
4 1000001 P00051442 FALSE 0-17 10 0 8170 9938
5 1000001 P00051842 FALSE 0-17 10 0 480 2849
6 1000001 P00057542 FALSE 0-17 10 0 345 13650
7 1000001 P00058142 FALSE 0-17 10 0 3412 11051
8 1000001 P00058242 FALSE 0-17 10 0 3412 10962
9 1000001 P00059442 FALSE 0-17 10 0 6816 16622
10 1000001 P00064042 FALSE 0-17 10 0 3412 8190