我想从我的数据中取每行的平均值,并找出该行中每个值与平均值的距离。如果百分比高于50,则应将此值替换为NA
。
这是数据:
structure(list(Name = structure(c(18L, 19L, 5L, 13L, 14L, 31L
), .Label = c("AMC Javelin", "Cadillac Fleetwood", "Camaro Z28",
"Chrysler Imperial", "Datsun 710", "Dodge Challenger", "Duster 360",
"Ferrari Dino", "Fiat 128", "Fiat X1-9", "Ford Pantera L", "Honda Civic",
"Hornet 4 Drive", "Hornet Sportabout", "Lincoln Continental",
"Lotus Europa", "Maserati Bora", "Mazda RX4", "Mazda RX4 Wag",
"Merc 230", "Merc 240D", "Merc 280", "Merc 280C", "Merc 450SE",
"Merc 450SL", "Merc 450SLC", "Pontiac Firebird", "Porsche 914-2",
"Toyota Corolla", "Toyota Corona", "Valiant", "Volvo 142E"), class = "factor"),
mpg_1 = c(125, 133, 143, 141, 134, 238), cyl_1 = c(114, 153,
112, 136, 128, 155), disp_1 = c(113, 143, 144, 131, 431,
331), hp_1 = c(332, 221, 113, 331, 134, 151)), .Names = c("Name",
"mpg_1", "cyl_1", "disp_1", "hp_1"), row.names = c(NA, 6L), class = "data.frame")
这就是所需的输出:
Name mpg_1 cyl_1 disp_1 hp_1
1 Mazda RX4 125 114 113 NA
2 Mazda RX4 Wag 133 153 143 221
3 Datsun 710 143 112 144 113
4 Hornet 4 Drive 141 136 131 NA
5 Hornet Sportabout 134 128 NA 134
6 Valiant 238 155 331 151
也有两个条件。
NA
。很难相信使用50%的截止值会有两个值,因为平均值会完全改变,但看第二个条件。你知道如何以有效的方式做到这一点吗?使用循环它看起来可行,但也许有更有效的方法?
答案 0 :(得分:3)
从统计学角度来看,正如@Roland在评论中提到的那样,不建议这样做。但是,如果你必须这样做,那么,
fun1 <- function(x, n){
t <- which((x - mean(x))/mean(x) > n)[1]
x[t] <- NA
return(x)
}
df1[-1] <- t(apply(df1[-1], 1, fun1, 0.5))
df1
# Name mpg_1 cyl_1 disp_1 hp_1
#1 Mazda RX4 125 114 113 NA
#2 Mazda RX4 Wag 133 153 143 221
#3 Datsun 710 143 112 144 113
#4 Hornet 4 Drive 141 136 131 NA
#5 Hornet Sportabout 134 128 NA 134
#6 Valiant 238 155 NA 151