Numeric sequence with condition

时间:2015-10-06 09:02:02

标签: r dataframe sequence dplyr sequences

I have a big data.frame that I want to generate a new column (called Seq) to, which has a sequential values that restarts every time there is a change in a different column. Here is an example of the data.frame (with omitted columns) and the new column called Seq. As you can see there is a sequentiel count, but everytime there is a new IDPath, the sequentiel count restarts. The sequentiel length can have different lengths, some are 1 long, while others are 300.

IDPath    LogTime               Seq
AADS      19-06-2015 01:57      1
AADS      19-06-2015 01:55      2
AADS      19-06-2015 01:54      3
AADS      19-06-2015 01:53      4
DHSD      19-06-2015 12:57      1
DHSD      19-06-2015 10:58      2
DHSD      19-06-2015 09:08      3
DHSD      19-06-2015 08:41      4

4 个答案:

答案 0 :(得分:5)

使用data.table包,以下是获取所需内容的方法:

require(data.table)
setDT(dt)[, Seq:=1:.N, by=IDPath]
# or, as mentioned by @DavidArenburg
setDT(dt)[, Seq:=seq_len(.N), by=IDPath]

dt
#   IDPath          LogTime Seq
#1:   AADS 19-06-2015 01:57   1
#2:   AADS 19-06-2015 01:55   2
#3:   AADS 19-06-2015 01:54   3
#4:   AADS 19-06-2015 01:53   4
#5:   DHSD 19-06-2015 12:57   1
#6:   DHSD 19-06-2015 10:58   2
#7:   DHSD 19-06-2015 09:08   3
#8:   DHSD 19-06-2015 08:41   4

答案 1 :(得分:4)

您还可以使用rleid包中的data.table函数,该函数专门用于在分组操作中生成游程长度类型ID列:

library(data.table)
setDT(df)[, Seq := rleid(LogTime), by=IDPath]

给出:

> df
   IDPath          LogTime Seq
1:   AADS 19-06-2015:01:57   1
2:   AADS 19-06-2015:01:55   2
3:   AADS 19-06-2015:01:54   3
4:   AADS 19-06-2015:01:53   4
5:   DHSD 19-06-2015:12:57   1
6:   DHSD 19-06-2015:10:58   2
7:   DHSD 19-06-2015:09:08   3
8:   DHSD 19-06-2015:08:41   4

另一种选择是使用rowid函数:

setDT(df)[, Seq := rowid(IDPath)]

答案 2 :(得分:3)

强制性的Hadleyverse答案(Hadleyvese回答后也包括基础R答案):

library(dplyr)

dat <- read.table(text="IDPath    LogTime 
AADS      '19-06-2015 01:57'      
AADS      '19-06-2015 01:55'    
AADS      '19-06-2015 01:54'      
AADS      '19-06-2015 01:53'      
DHSD      '19-06-2015 12:57'      
DHSD      '19-06-2015 10:58'      
DHSD      '19-06-2015 09:08'      
DHSD      '19-06-2015 08:41'      ", header=TRUE, stringsAsFactors=FALSE, quote="'")

mutate(group_by(dat, IDPath), Seq=1:n())

或(通过David Arenburg)

mutate(group_by(dat, IDPath), Seq=row_number())

或者如果你正在进行管道:

dat %>%
  group_by(IDPath) %>%
  mutate(Seq=1:n())

或(通过David Arenburg)

dat %>%
  group_by(IDPath) %>%
  mutate(Seq=row_number())

强制性基础R回答:

unsplit(lapply(split(dat, dat$IDPath), transform, Seq=1:length(IDPath)), dat$IDPath)

或更具惯用性(再次通过David)

with(dat, ave(IDPath, IDPath, FUN = seq_along))

如果它确实是一个巨大的数据框,那么您可能希望以tbl_dt(dat)开始dplyr解决方案,但如果您已经使用data.table,那么CathG或Jaap的版本会更快}。

答案 3 :(得分:1)

这可能有点冗长,但很简单,

alphabets <- c("a", "a", "b", "c", "c")
df <- data.frame(alphabets)
a <- table(df$alphabets)
k <- 1


for (i in 1:length(a))
{
 l <- 1
 for(j in 1:a[i])
{
   df$seq[k] <- l
   k <- k+ 1
   l <- l+ 1
}
}

df
#  alphabets seq
#1         a   1
#2         a   2
#3         b   1
#4         c   1
#5         c   2