我有以下数据集:
structure(list(SERIAL = c(118694001L, 118694001L, 118694001L,
118695001L, 118696001L, 118696001L, 118696001L, 118697001L, 118698001L,
118698001L, 118699001L, 118699001L, 118699001L, 118700001L, 118700001L,
118701001L, 118701001L), RELATED = c(9999L, 9999L, 9999L, 3100L,
3100L, 3100L, 3100L, 3100L, 3100L, 3100L, 9999L, 9999L, 9999L,
3100L, 3100L, 3100L, 3100L)), class = "data.frame", row.names = c(NA,
-17L))
我想创建一个新的列“ count”,以计算“相关”列中数字3100的出现率,但是必须按SERIAL分组。
我尝试过
df <- within(data, DILs2 <- ave(SERIAL, list(SERIAL, RELATED == 3100), FUN=length))
结果应如下所示:
SERIAL RELATED Count
118694001 9999 0
118694001 9999 0
118694001 9999 0
118695001 3100 1
118696001 3100 3
118696001 3100 3
118696001 3100 3
118697001 3100 1
118698001 3100 2
118698001 3100 2
118699001 9999 0
118699001 9999 0
118699001 9999 0
118700001 3100 2
118700001 3100 2
118701001 3100 2
118701001 3100 2
答案 0 :(得分:2)
如果要在RELATED中查找多个值,则最好使用group_by(SERIAL, RELATED)
,然后使用mutate(count = n())
。
以下代码可能会帮助您继续前进
您可以使用dplyr
进行以下操作:
library(dplyr)
df %>%
group_by(SERIAL) %>%
summarise(count = sum(RELATED == 3100))
# A tibble: 8 x 2
SERIAL count
<int> <int>
1 118694001 0
2 118695001 1
3 118696001 3
4 118697001 1
5 118698001 2
6 118699001 0
7 118700001 2
8 118701001 2
或在data.table
中为:
library(data.table)
setDT(df)[, .(count = sum(RELATED == 3100)), SERIAL]
SERIAL count
1: 118694001 0
2: 118695001 1
3: 118696001 3
4: 118697001 1
5: 118698001 2
6: 118699001 0
7: 118700001 2
8: 118701001 2
或者在基数R中使用aggregate
如下:
aggregate(RELATED ~ SERIAL, data=df, function(x) {sum(x == 3100)})
SERIAL RELATED
1 118694001 0
2 118695001 1
3 118696001 3
4 118697001 1
5 118698001 2
6 118699001 0
7 118700001 2
8 118701001 2
答案 1 :(得分:1)
使用数据框。只需一行代码即可完成。
> data.frame(table(df$SERIAL,df$RELATED))
Var1 Var2 Freq
1 118694001 3100 0
2 118695001 3100 1
3 118696001 3100 3
4 118697001 3100 1
5 118698001 3100 2
6 118699001 3100 0
7 118700001 3100 2
8 118701001 3100 2
9 118694001 9999 3
10 118695001 9999 0
11 118696001 9999 0
12 118697001 9999 0
13 118698001 9999 0
14 118699001 9999 3
15 118700001 9999 0
16 118701001 9999 0
其余就是美学。
希望这会有所帮助。
答案 2 :(得分:1)
您也可以这样做-
library(data.table)
setDT(dt)
dt[,count:=.N,by=c("SERIAL")]
dt[,count:=ifelse(RELATED!=3100,0,count)]
> dt
SERIAL RELATED count
1: 118694001 9999 0
2: 118694001 9999 0
3: 118694001 9999 0
4: 118695001 3100 1
5: 118696001 3100 3
6: 118696001 3100 3
7: 118696001 3100 3
8: 118697001 3100 1
9: 118698001 3100 2
10: 118698001 3100 2
11: 118699001 9999 0
12: 118699001 9999 0
13: 118699001 9999 0
14: 118700001 3100 2
15: 118700001 3100 2
16: 118701001 3100 2
17: 118701001 3100 2