例如,我有一个包含两个因子变量和1000行的数据帧。我想通过返回给每5行最频繁出现的水平的向量,将观察次数减少到200次。
Prompt A,B,C
B²-4AC
If Ans<0
Disp "No Real Solutions
If not(Ans
Disp "One Solution",-B/2/A
If D>0
Then
(-√(D)-B)/2/A→E
(√(D)-B)/2/A→F
End
我希望输出给出两列如下:
df <- data.frame(test=factor(sample(c("A","B", "C" ),1000,replace=TRUE)))
df$test2 <- factor(sample(c("dog", "cat", "fish"), 1000, replace=TRUE))
head(df, 15)
test test2
1 C fish
2 B dog
3 A fish
4 B fish
5 B dog
6 A cat
7 B cat
8 C fish
9 C fish
10 C cat
11 B dog
12 A fish
13 B dog
14 B cat
15 C dog
我找到了一些示例,其中最常见的类别是在一行中的列之间找到但不是向下列和行数。提前感谢任何建议。非常感谢
答案 0 :(得分:0)
我们可以尝试使用data.table
。转换&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df)
),按&#39; test&#39;,&#39; test2&#39;分组和一个通过复制200乘5(&#39; grp&#39;)的序列创建的变量,按&#39; grp&#39;分组,我们得到Data.table的子集(.SD
)其中&#39; N&#39;是最大值(which.max(N)
)。如果需要,我们可以分配&#39; grp&#39;和&#39; N&#39;列到&#39; NULL&#39;。
library(data.table)
res <- setDT(df)[, .N, by = .(test, test2, grp = rep(1:200, each = 5))
][, .SD[which.max(N)], by = grp][, c("grp", "N") := NULL][]
dim(res)
#[1] 200 2
由于OP没有使用set.seed
来创建sample
,因此输出将不同。通过使用OP的帖子中显示的前15行
setnames(setDT(df1)[, .N, by = .(test, test2, grp= rep(1:3, each = 5))
][, .SD[which.max(N)] , grp][, c("grp", "N") := NULL][], paste0(names(df1), "ANS"))[]
# testANS test2ANS
#1: B dog
#2: C fish
#3: B dog
根据评论,列频率似乎应该单独进行
setDT(df1)[, grp:= rep(1:3, each = 5)][,
testN := .N ,by = .(grp, test)][, test2N := .N, by = .(grp, test2)
][, .(testANS = test[which.max(testN)], test2ANS = test2[which.max(test2N)]), grp]
# grp testANS test2ANS
#1: 1 B fish
#2: 2 C cat
#3: 3 B dog
注意:在原始数据集中,将rep(1:3, each = 5)
更改为rep(1:200, each = 5)
df1 <- structure(list(test = c("C", "B", "A", "B", "B", "A", "B", "C",
"C", "C", "B", "A", "B", "B", "C"), test2 = c("fish", "dog",
"fish", "fish", "dog", "cat", "cat", "fish", "fish", "cat", "dog",
"fish", "dog", "cat", "dog")), .Names = c("test", "test2"),
class = "data.frame", row.names = c(NA, -15L))