在我的真实数据中,我对多个变量有多个异常值。我的数据看起来像下面的示例,但是这里的数字是完全随机的。我想使用95%的winsorization提取大于或小于2 SD的所有数据点。
df <- read.csv(header=TRUE, text="
id, group, test1, test2
1, 0, 57, 82
2, 0, 77, 80
3, 0, 67, 90
4, 0, 15, 70
5, 0, 58, 72
6, 1, 18, 44
7, 1, 44, 44
8, 1, 18, 46
9, 1, 20, 44
10, 1, 14, 38")
我知道'robustHD'软件包中的'winsorize'功能,但不确定:如何确保winsorization涵盖了两个不同的组,并且在该winsorization中包含了多个变量。
我已经尝试使用此代码解决问题,但代码不完整:
library(robustHD)
library(dplyr)
new.df.wins = df %>%
group_by(group) %>%
mutate(measure_winsorized = winsorize(c(test1,test2)))
返回错误指示
Error: Column `measure_winsorized` must be length 45 (the group size) or one, not 90
我也欢迎其他想法。谢谢!
答案 0 :(得分:0)
考虑为要数字化的每个数字字段创建两个新字段:
new.df.wins <- df %>%
group_by(group) %>%
mutate(measure_winsorized_test1 = winsorize(test1),
measure_winsorized_test2 = winsorize(test2))
或者以基数R的ave
:
new.df.wins <- within(df, {
measure_winsorized_test2 <- ave(test2, group, FUN=winsorize)
measure_winsorized_test1 <- ave(test1, group, FUN=winsorize)
})
如果要同时对两者进行winsorize,请立即分配给两个新列:
# TIDYVERSE (dplyr)
new.df.wins <- df %>%
group_by(group) %>%
mutate_at(.funs = list(wins = winsorize), .vars = vars(test1:test2))
# TINYVERSE (I.E. BASE R)
df[c("test1_wins", "test2_wins")] <- with(df, ave(cbind(test1, test2),
group, FUN=winsorize))
答案 1 :(得分:0)
您可以制作适用于数据帧的winsorize()
版本,并将其与by()
一起使用
# Example data
set.seed(1)
df2 <- round(matrix(rt(100, 4), 20), 3)
df2 <- data.frame(id=seq_len(nrow(df2)),
group=sort(rep(1:2, length=nrow(df2))),
test=df2)
df2[c(1:3, 11:13),]
# id group test.1 test.2 test.3 test.4 test.5
# 1 1 1 -0.673 -1.227 0.015 -0.831 0.024
# 2 2 1 -0.584 1.059 1.492 0.833 -0.377
# 3 3 1 0.572 0.613 -1.924 -0.672 1.184
# 11 11 2 0.054 0.020 2.241 -0.103 -0.047
# 12 12 2 1.746 -0.788 -0.268 -1.921 4.577
# 13 13 2 -0.472 -1.294 -0.258 0.795 -1.110
# data frame version of winsorize
winsorizedf <- function(x, ...) {
do.call(cbind, lapply(x, winsorize, ...))
}
# winsorize every column, except the two first ones, grouped by df2$group
w <- do.call(rbind,
by(df2[, -(1:2)], df2$group, winsorizedf))
# combine the winsorized columns with the original id and group columns
dfw <- data.frame(df2[, 1:2], round(w, 2))
dfw[c(1:3, 11:13),]
# id group test.1 test.2 test.3 test.4 test.5
# 1 1 1 -0.63 -1.23 0.02 -0.83 0.02
# 2 2 1 -0.58 1.06 1.49 0.26 -0.38
# 3 3 1 0.57 0.61 -1.60 -0.67 1.18
# 11 11 2 0.05 0.02 1.23 -0.10 -0.05
# 12 12 2 1.70 -0.79 -0.27 -1.92 4.58
# 13 13 2 -0.47 -1.07 -0.26 0.80 -1.11