R data.table:标记连续非NA值的计数

时间:2017-05-08 18:45:45

标签: r label data.table

这是我的data.table,列column1

library(data.table)

dt = data.table(column1 = c(NA, NA, "A", "A", "A", NA, NA, NA, NA, "B", NA, NA, "1 2", "1 2", NA, NA, "A", "A", "A", "A", "A", NA, NA, NA, NA, ...))

> print(dt)
    column1
 1:      NA
 2:      NA
 3:       A
 4:       A
 5:       A
 6:      NA
 7:      NA
 8:      NA
 9:      NA
10:       B
11:      NA
12:      NA
13:     1 2
14:     1 2
15:      NA
16:      NA
17:       A
18:       A
19:       A
20:       A
21:       A
22:      NA
23:      NA
24:      NA
25:      NA
...     ...

column1中的值是NA值或字符。我想按照该组中的项目数标记每个连续的非NA值组。这是dt$labels

的目的
> print(dt)
    column1    labels
 1:      NA    0 
 2:      NA    0 
 3:       A    3 
 4:       A    3 
 5:       A    3    
 6:      NA    0 
 7:      NA    0  
 8:      NA    0
 9:      NA    0 
10:       B    1 
11:      NA    0 
12:      NA    0  
13:     1 2    2  
14:     1 2    2  
15:      NA    0 
16:      NA    0 
17:       A    5  
18:       A    5  
19:       A    5  
20:       A    5     
21:       A    5    
22:      NA    0  
23:      NA    0 
24:      NA    0   
25:      NA    0   
...     ...    ...   

有3个连续的A,1个“B”,2个“1 2”和5个“A”。

rle()

一起使用
x <- rle(dt$column1) 

将给出每个唯一值的长度

 Run Length Encoding                                                                                                                                                                                                                                                                        
   lengths: int [1:18] 1 1 3 1 1 1 1 1 1 1 ...                                                                                                                                                                                                                                              
   values : chr [1:18] NA NA "A" NA NA NA NA "B" NA NA "1 2" ...  

但我不确定如何将这些长度映射到data.table列labels

2 个答案:

答案 0 :(得分:6)

我们可以使用rleid中的data.table来创建分组变量,然后将逻辑向量与.N相乘,并将输出分配(:=)到&#39;标签&#39;

dt[, labels := .N*!is.na(column1), rleid(is.na(column1))]
dt
#    column1 labels
# 1:      NA      0
# 2:      NA      0
# 3:       A      3
# 4:       A      3
# 5:       A      3
# 6:      NA      0
# 7:      NA      0
# 8:      NA      0
# 9:      NA      0
#10:       B      1
#11:      NA      0
#12:      NA      0
#13:     1 2      2
#14:     1 2      2
#15:      NA      0
#16:      NA      0
#17:       A      5
#18:       A      5
#19:       A      5
#20:       A      5
#21:       A      5
#22:      NA      0
#23:      NA      0
#24:      NA      0
#25:      NA      0

数据

dt <- data.table(column1 = c(NA, NA, "A", "A", "A", NA, NA, NA, NA, "B", 
  NA, NA, "1 2", "1 2", NA, NA, "A", "A", "A", "A", "A", NA, NA, NA, NA))

答案 1 :(得分:0)

@Akrun的答案在NA中没有在column1中重复的值时效果很好。例如(请注意,唯一的区别是我将两个NA的第一个“ B”改为“ A”):

dt <- data.table(column1 = c(NA, NA, "A", "A", "A", NA, NA, "A", "A", "B", 
                         NA, NA, "1 2", "1 2", NA, NA, "A", "A", "A", "A", "A", NA, NA, NA, NA))

为了确保第一组连续的“ A”都在同一组中,下面的代码(略微修改)将起作用:

dt[!is.na(column1), labels:=rleid(column1), rleid(is.na(column1))]

输出如下:

        column1 labels
 1:    <NA>     NA
 2:    <NA>     NA
 3:       A      1
 4:       A      1
 5:       A      1
 6:    <NA>     NA
 7:    <NA>     NA
 8:       A      1
 9:       A      1
10:       B      2
11:    <NA>     NA
12:    <NA>     NA
13:     1 2      3
14:     1 2      3
15:    <NA>     NA
16:    <NA>     NA
17:       A      4
18:       A      4
19:       A      4
20:       A      4
21:       A      4
22:    <NA>     NA
23:    <NA>     NA
24:    <NA>     NA
25:    <NA>     NA

要将NA替换为零:dt[is.na(labels), labels:=0]