如何在R中按组累计跟踪唯一组合

时间:2018-06-11 18:44:14

标签: r count data.table unique

我观察到与一个或多个START / END日期对相关联的唯一标识符。这些观察值在日期范围内按月按ID迭代。一个唯一ID和类别的示例,截断长度。

  ID       START        END    MONTH CAT.A
10056 2004-01-08 2005-01-07 Jan 2004 
10056 2004-01-08 2005-01-07 Feb 2004 
10056 2004-01-08 2005-01-07 Mar 2004 
...
10056 2004-01-08 2005-01-07 Nov 2004 
10056 2004-01-08 2005-01-07 Dec 2004 
10056 2004-01-08 2005-01-07 Jan 2005 
--------------------------------------
10056 2006-11-28 2008-02-20 Nov 2006 
10056 2006-11-28 2008-02-20 Dec 2006 
10056 2006-11-28 2008-02-20 Jan 2007 
...
10056 2006-11-28 2008-02-20 Dec 2007 
10056 2006-11-28 2008-02-20 Jan 2008 
10056 2006-11-28 2008-02-20 Feb 2008 
--------------------------------------
10056 2010-01-30 2011-02-03 Jan 2010 
10056 2010-01-30 2011-02-03 Feb 2010 
10056 2010-01-30 2011-02-03 Mar 2010 
...
10056 2010-01-30 2011-02-03 Dec 2010 
10056 2010-01-30 2011-02-03 Jan 2011 
10056 2010-01-30 2011-02-03 Feb 2011 

我正在寻找的解决方案会累计计算CAT.A的每个唯一事件。在第一个日期范围CAT.A为1,在第二个日期范围内将CAT.A增加为2,在第三个日期范围内为3。此计数器对此ID是唯一的,否则为NA

  ID       START        END    MONTH CAT.A
10056 2004-01-08 2005-01-07 Jan 2004 1
10056 2004-01-08 2005-01-07 Feb 2004 1
10056 2004-01-08 2005-01-07 Mar 2004 1
...
10056 2004-01-08 2005-01-07 Nov 2004 1
10056 2004-01-08 2005-01-07 Dec 2004 1
10056 2004-01-08 2005-01-07 Jan 2005 1
--------------------------------------
10056 2006-11-28 2008-02-20 Nov 2006 2
10056 2006-11-28 2008-02-20 Dec 2006 2
10056 2006-11-28 2008-02-20 Jan 2007 2
...
10056 2006-11-28 2008-02-20 Dec 2007 2
10056 2006-11-28 2008-02-20 Jan 2008 2
10056 2006-11-28 2008-02-20 Feb 2008 2
--------------------------------------
10056 2010-01-30 2011-02-03 Jan 2010 3
10056 2010-01-30 2011-02-03 Feb 2010 3
10056 2010-01-30 2011-02-03 Mar 2010 3
...
10056 2010-01-30 2011-02-03 Dec 2010 3
10056 2010-01-30 2011-02-03 Jan 2011 3
10056 2010-01-30 2011-02-03 Feb 2011 3

数据集有数百万个其他唯一ID和11个其他类别,但如果我能找到这个子集的解决方案,我应该能够将它应用于整个数据集。

我找到的解决方案可以让我计算IDSTARTEND的唯一组合总数,但不会有任何有助于在每次观察中增加A的内容只有当它属于新的唯一STARTEND事件时才会发生。

我一直在使用data.table和lubridate。

1 个答案:

答案 0 :(得分:0)

这个怎么样?

bash: /spark-submit: No such file or directory

如果你想为每个ID值想要一个单独的列(这看起来很奇怪,假设你有数百万个唯一值),你可以使用这样的东西:

d = data.table(
    ID = c(rep(1,5), rep(2,5)),
    CAT = c(1,1,1,2,2,1,1,2,3,4)
    )

d[, N_Unique := cumsum(!duplicated(CAT)), by = ID]

> d
    ID CAT N_Unique
 1:  1   1        1
 2:  1   1        1
 3:  1   1        1
 4:  1   2        2
 5:  1   2        2
 6:  2   1        1
 7:  2   1        1
 8:  2   2        2
 9:  2   3        3
10:  2   4        4