我正在尝试进行自动过滤,以摆脱无用的变量。我在一个命令中处理我的数据,该命令使用此命令删除在表中重复超过“x”次的任何值
df <- df[, which(apply(df, 2, function(col) !any(table(col) > x)))]
我现在正在尝试应用相同的东西,但是对于2个级别,这是我的数据看起来像
df <- structure(list(V1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 0, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), V2 = c(2, 2, 2, 2, 2, 2, 2,
2, 0, 0, 0, 2, 2, 7, 2, 3, 4, 6, 4, 5, 2), V3 = c(0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1), level = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("V1",
"V2", "V3", "level"), row.names = c(NA, 21L), class = "data.frame")
我想在两个级别A和B中删除任何重复相同值超过x次(在本例中为5次)的变量。我想要的输出是
df2 <- structure(list(V1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L,
0L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L), V2 = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 0L, 0L, 0L, 2L, 5L, 7L, 2L, 3L, 4L,
6L, 4L, 5L, 2L), level = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor")), .Names = c("V1", "V2", "level"), class = "data.frame", row.names = c(NA,
-21L))
我根据级别考虑了subset()
数据,执行我之前的命令并再次加入它们,但这似乎是一个很长的路。我想不出一个合适的命令来完成这项工作。对于更短命令的任何想法都会这样做吗?
谢谢,
答案 0 :(得分:2)
在两列上使用table
获取双向表,然后使用apply
并查看结果表中的any
行是否all
TRUE
值(即值出现超过x
次....
# Two column tables
lens <- lapply( df[ , -ncol(df) ] , function(x) table( x , df$level ) > 5 )
# Which columns have ANY values that have more repeats in ALL levels
ind <- sapply( lens , function(x) ! any( apply( x , 1 , all ) ) )
# Subset
df <- df[, ind ]
head( df )
V1 V2 level
1 1 2 A
2 2 2 A
3 3 2 A
4 4 2 A
5 5 2 A
6 6 2 A