Question

我想计算某人何时开始并停止在博客上发表评论。我的数据集在第1列中有一个用户ID，后跟6列，每列代表一个时间段内的注释数。我想创建一个列（＆＃39;有效＆＃39;），用于标识出现非零值的第一个时段。我还想创建一个列（＆＃39;无效＆＃39;），用于标识零值跟随非零值的第一个时段，并且在后续时段中仅跟随零。

这是10行样本数据：

structure(list(userid = c(199782L, 30982L, 27889L, 108358L, 29620L, 
229214L, 37531L, 711L, 30516L, 32360L), Period1 = c(0L, 1L, 43L, 
0L, 189L, 0L, 0L, 142L, 26L, 0L), Period2 = c(0L, 36L, 40L, 18L, 
32L, 0L, 6L, 55L, 159L, 0L), Period3 = c(0L, 68L, 25L, 110L, 
1L, 0L, 31L, 14L, 32L, 0L), Period4 = c(0L, 45L, 0L, 91L, 0L, 
0L, 54L, 1L, 0L, 0L), Period5 = c(93L, 27L, 57L, 0L, 0L, 35L, 
79L, 4L, 0L, 26L), Period6 = c(132L, 47L, 37L, 4L, 0L, 186L, 
50L, 2L, 0L, 191L)), .Names = c("userid", "Period1", "Period2", 
"Period3", "Period4", "Period5", "Period6"), row.names = 175:184, class = "data.frame")

其中5行的选定输出如下。没有“非活动”的价值。表示用户仍处于活动状态。

userid, active, inactive
199782, 5
27889, 1
29620, 1, 3
37531, 2
30516, 1, 3

有人能指出我如何处理这个问题的正确方向吗？谢谢！

Answer 1

使用data.table作为糖语法并按ID组继续（在输入长格式后）：

library(data.table)
melt(setDT(dat),id.vars='userid')[,
    list(active=min(which(value>0)),
         inactive={ mm = cumsum(value)
                    ## treat the case where we have leading 0 in value
                    mm = duplicated(mm[mm!=0])
               ## Note the use of integer here otheriwse data.table will complain about types...
                       ifelse(any(mm) && max(which(mm))==length(value),
                               min(which(mm)),NA_integer_)
         }),userid]

     userid active inactive
 1: 199782      5       NA
 2:  30982      1       NA
 3:  27889      1       NA
 4: 108358      2       NA
 5:  29620      1        4
 6: 229214      5       NA
 7:  37531      2       NA
 8:    711      1       NA
 9:  30516      1        4
10:  32360      5       NA

解释，对于每个id：

活动列只是第一个值非空
非活动列更棘手。它是累积值中重复值的最小索引。我们应该从这个累积和中删除null值，以避免值以零开头的情况。这是一个简单的例子：
```
 cumsum(c(1,0,1))  
[1] 1 1 2
      _    ## we want to extract the index of one here
min(which(duplicated(cumsum(c(1,0,1)))))
2 
```

如何创建指示R中时间序列中的开始和结束时段的变量

1 个答案: