我如何能够在每个小组中找到R组中的第一个和最后一个观察结果

时间:2015-01-18 03:19:20

标签: r

我的数据集如下

dialled     Ringing     state   duration
NA  NA  NA  0
NA  NA  NA  0
NA  NA  NA  0
NA  NA  NA  0
123 NA  NA  0
123 NA  NA  0
123 NA  NA  0
123 NA  NA  60
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
NA  NA  inactive    0
NA  145 inactive    0
NA  145 inactive    0
NA  145 inactive    56
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
222 NA  inactive    0
222 NA  inactive    0
222 NA  inactive    37
NA  NA  active  0
NA  NA  active  0
NA  NA  inactive    0
123 NA  inactive    0
123 NA  inactive    0
123 NA  active  60
NA  NA  active  0

我想要获得第一名和最后一名。对于每个dialled个数字(重复一个,因为每个呼叫都不同)。我正在寻找的答案是

dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
222 NA  inactive    0
222 NA  inactive    37
123 NA  NA  0
123 NA  NA  60   

我使用以下

library(plyr)
ddply(DF, .(Dialled_nbr), function(x) x[c(1,nrow(x)), ]) which gave me

dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
222 NA  inactive    0
222 NA  inactive    37

但答案不正确。请帮忙

新数据


dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  0
123 NA  NA  60
123 NA  NA  0
123 NA  NA  0
123 NA  NA  70
222 NA  inactive    0
222 NA  inactive    0
222 NA  inactive    37
123 NA  inactive    0
123 NA  inactive    0
123 NA  active  60


Answer to be
dialled     Ringing     state   duration
123 NA  NA  0
123 NA  NA  60
123 NA  NA  0
123 NA  NA  70
222 NA  inactive    0
222 NA  inactive    37
123 NA  inactive    0
123 NA  active  60

2 个答案:

答案 0 :(得分:3)

以下是data.table_1.9.5的选项。使用setDT从“data.frame”创建“data.table”,删除“已拨”列(NA)中的!is.na(dialled)值,使用{{1}生成分组变量在“Dialled_nbr”上,获取分组变量(rleid)级别的第一行和最后一行的行索引,最后根据行索引对“dt1”进行子集化。

.I(c(1L, .N)]

或使用library(data.table) dt1 <- setDT(df)[!is.na(dialled)] dt1[dt1[,.I[c(1L, .N)],rleid(dialled)]$V1] # dialled Ringing state duration #1: 123 NA NA 0 #2: 123 NA NA 60 #3: 222 NA inactive 0 #4: 222 NA inactive 37 #5: 123 NA inactive 0 #6: 123 NA active 60

base R

更新

基于新数据集,

df1 <- df[!is.na(df$dialled),]
grp<-  inverse.rle(within.list(rle(df1$dialled), 
                    values <- seq_along(values)))

df1[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
#    dialled Ringing    state duration
#5      123      NA     <NA>        0
#8      123      NA     <NA>       60
#19     222      NA inactive        0
#21     222      NA inactive       37
#25     123      NA inactive        0
#27     123      NA   active       60

数据

grp <- cumsum(c(TRUE,df$duration[-nrow(df)]!=0))
df[!duplicated(grp)|!duplicated(grp,fromLast=TRUE),]
#   dialled Ringing    state duration
#1      123      NA     <NA>        0
#3      123      NA     <NA>       60
#4      123      NA     <NA>        0
#6      123      NA     <NA>       70
#7      222      NA inactive        0
#9      222      NA inactive       37
#10     123      NA inactive        0
#12     123      NA   active       60

newdata

 df <- structure(list(dialled = c(NA, NA, NA, NA, 123L, 123L, 123L, 
 123L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 222L, 222L, 222L, 
 NA, NA, NA, 123L, 123L, 123L, NA), Ringing = c(NA, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, 145L, 145L, 145L, NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), state = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, "active", "active", "inactive", "inactive", 
 "inactive", "inactive", "inactive", "active", "active", "inactive", 
 "inactive", "inactive", "inactive", "active", "active", "inactive", 
 "inactive", "inactive", "active", "active"), duration = c(0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 60L, 0L, 0L, 0L, 0L, 0L, 0L, 56L, 0L, 
 0L, 0L, 0L, 0L, 37L, 0L, 0L, 0L, 0L, 0L, 60L, 0L)), .Names = 
 c("dialled", "Ringing", "state", "duration"), class = "data.frame", 
 row.names = c(NA, -28L))

答案 1 :(得分:2)

以下是两个选项。首先,我们需要设置一些将在两个选项中使用的东西。

## remove rows where 'dialled' is NA 
ndf <- DF[!is.na(DF$dialled),]
## run-length encoding on the 'dialled' column in 'ndf'
le <- rle(ndf$dialled)$lengths

选项1:创建一个行号的整数向量,用于子集。

ndf[cumsum(mapply(c, 1L, le-1L)), ]
#    dialled Ringing    state duration
# 5      123      NA     <NA>        0
# 8      123      NA     <NA>       60
# 19     222      NA inactive        0
# 21     222      NA inactive       37
# 25     123      NA inactive        0
# 27     123      NA   active       60

如果您不想循环播放,则可以将mapply来电替换为vec,定义为

vec <- replace(integer(2*length(le))+1L, c(FALSE, TRUE), le-1L)

选项2:添加帮助id列。然后使用dplyr函数根据新的id列获取第一行和最后一行。

library(dplyr)    
## updated data with new column
DF2 <- cbind(id = rep.int(seq_along(le), le), ndf)    
## group by id and filter on the first and last rows
slice(group_by(DF2, id), c(1, n()))
#   id dialled Ringing    state duration
# 1  1     123      NA       NA        0
# 2  1     123      NA       NA       60
# 3  2     222      NA inactive        0
# 4  2     222      NA inactive       37
# 5  3     123      NA inactive        0
# 6  3     123      NA   active       60

如果需要,您可以删除帮助列,但以后它也可以派上用场。