如何对100k行数据集中的每条记录进行最后N天的累计计数?

时间:2017-03-12 03:57:07

标签: r csv

如何使用R

中的数据条件生成累积不同计数

我在R中遇到问题。数据集包含9个分类变量,因此我必须操纵数字变量才能将分类变量转换为数字。我想要构建的新变量看起来像这样,例如,过去7天内该记录中同一SSN的不同号码家庭电话。我写的部分代码如下。我花了2个小时来计算100k数据集中只有30%的记录。我如何优化此代码以提高效率?谢谢。

原始变量: 记录日期ssn firstname姓氏地址zip dob homephone

数据样本:Here is a link to an image of my one of my data samples

### Data sample[enter image description here][1]
|record | date      | ssn       |homephone  ||date1   |dscnt|
|52     | 20150101  |737610282  |9105580920 |20150101 | 1   |
|91     | 20150101  |737610282  |3301371772 |20150101 | 2   |
|144    | 20150101  |737610282  |4139873682 |20150101 | 3   |
|236    | 20150101  |737610282  |4345558469 |20150101 | 4   |
|272    | 20150101  |737610282  |3979419673 |20150101 | 5   |
|282    | 20150102  |737610282  |8261221737 |20150102 | 6   |
|330    | 20150102  |737610282  |9105580920 |20150102 | 6   |
|412    | 20150102  |737610282  |7115050432 |20150102 | 7   |
|493    | 20150102  |737610282  |6872924382 |20150102 | 8   |
|571    | 20150103  |737610282  |2528003837 |20150103 | 9   |
|621    | 20150103  |737610282  |9105580920 |20150103 | 9   |
|763    | 20150103  |737610282  |581105006  |20150103 | 10  |
|854    | 20150104  |737610282  |9105580920 |20150104 | 10  |
|883    | 20150104  |737610282  |979778244  |20150104 | 11  |
|966    | 20150104  |737610282  |9952278232 |20150104 | 12  |
|974    | 20150104  |737610282  |3664924933 |20150104 | 13  |
|1057   | 20150104  |737610282  |9105580920 |20150104 | 13  |
|1091   | 20150104  |737610282  |32384275   |20150104 | 14  |
|1166   | 20150105  |737610282  |942052630  |20150105 | 15  |
|1304   | 20150105  |737610282  |9105580920 |20150105 | 15  |
|1416   | 20150105  |737610282  |1384461737 |20150105 | 16  |

这些是代码:

过去7天内相同SSN,地址,电话的名称
    i= 1
    e1 = NULL

    for (i in 1:365)
    {
      i = 360
      a1 <- df0 %>% filter(date <= DT[365-i,])
      b1 <- 
        a1 %>%
        select(record,date,ssn,homephone) %>%
        mutate(date1 = date + days(as.integer(today() -max(a$date)) )) %>%
        filter(date1 >= today() - days(7)) %>%
        group_by(ssn) %>%
        mutate(dscnt = n_distinct(homephone)) 

      f1 = b1 %>% filter(date == DT[365-i,])

      e1 = rbind(e1,f1)
    }

0 个答案:

没有答案