我有一个看起来像这样的数据集,其中一列可以有四个不同的值:
dataset <- data.frame(out = c("a","b","c","a","d","b","c","a","d","b","c","a"))
在R中,我想创建第二列,按顺序计算包含特定值的累计行数。因此输出列将如下所示:
out
1
1
1
2
1
2
2
3
2
3
3
4
答案 0 :(得分:14)
试试这个:
dataset <- data.frame(out = c("a","b","c","a","d","b","c","a","d","b","c","a"))
with(dataset, ave(as.character(out), out, FUN = seq_along))
# [1] "1" "1" "1" "2" "1" "2" "2" "3" "2" "3" "3" "4"
当然,您可以使用data.frame
out$asNumbers <- with(dataset, ave(as.character(out), out, FUN = seq_along))
中的列
“dplyr”方法也很不错。逻辑与“data.table”方法非常相似。一个优点是您不需要使用上面提到的as.numeric
方法所需的ave
来包装输出。
dataset %>% group_by(out) %>% mutate(count = sequence(n()))
# Source: local data frame [12 x 2]
# Groups: out
#
# out count
# 1 a 1
# 2 b 1
# 3 c 1
# 4 a 2
# 5 d 1
# 6 b 2
# 7 c 2
# 8 a 3
# 9 d 2
# 10 b 3
# 11 c 3
# 12 a 4
第三种选择是使用我的“splitstackshape”包中的getanID
。对于这个特定的例子,你只需要指定data.frame
名称(因为它是一个列),但是,通常,你会更具体,并提到目前作为“ids”的列,以及该函数将检查它们是否是唯一的,或者是否需要累积序列来使它们唯一。
library(splitstackshape)
# getanID(dataset, "out") ## Example of being specific about column to use
getanID(dataset)
# out .id
# 1: a 1
# 2: b 1
# 3: c 1
# 4: a 2
# 5: d 1
# 6: b 2
# 7: c 2
# 8: a 3
# 9: d 2
# 10: b 3
# 11: c 3
# 12: a 4
答案 1 :(得分:7)
正如阿南达指出的那样,你可以使用更简单的方法:
DT[, counts := sequence(.N), by = "V1"]
(DT
如下所示)
您可以创建一个“计数”列,初始化为1,然后按因子计算累积总和。
以下是data.table
# Called the column V1
dataset<-data.frame(V1=c("a","b","c","a","d","b","c","a","d","b","c","a"))
library(data.table)
DT <- data.table(dataset)
DT[, counts := 1L]
DT[, counts := cumsum(counts), by=V1]; DT
# V1 counts
# 1: a 1
# 2: b 1
# 3: c 1
# 4: a 2
# 5: d 1
# 6: b 2
# 7: c 2
# 8: a 3
# 9: d 2
# 10: b 3
# 11: c 3
# 12: a 4