按行计数,避免熔化/聚集

时间:2017-01-27 11:19:00

标签: r data.table dplyr

我正在使用像这样的数据框:

   idno      08:00      08:05      08:10    08:15    08:20    08:25
1     1   Domestic   Domestic   Domestic Domestic Domestic Domestic
2     2    Leisure    Leisure    Leisure  Leisure  Leisure  Leisure
3     3        Eat        Eat        Eat      Eat      Eat      Eat
4     4       Paid       Paid       Paid     Paid     Paid     Paid
5     5      Sleep      Sleep      Sleep    Sleep    Sleep    Sleep
6     6        Eat        Eat        Eat  Missing  Missing  Missing
7     7      Sleep      Sleep      Sleep    Sleep    Sleep    Sleep
8     8       Paid       Paid       Paid     Paid     Paid     Paid
9     9      Sleep      Sleep      Sleep    Sleep    Sleep    Sleep
10   10 Child Care Child Care Child Care   Travel   Travel   Travel

我感兴趣的是总结这样的数据帧。

输出想要的)

       idno `Child Care` Domestic   Eat Leisure Missing  Paid Sleep Travel
*  <int>        <dbl>    <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>
1      1            0        6     0       0       0     0     0      0
2      2            0        0     0       6       0     0     0      0
3      3            0        0     6       0       0     0     0      0
4      4            0        0     0       0       0     6     0      0
5      5            0        0     0       0       0     0     6      0
6      6            0        0     3       0       3     0     0      0
7      7            0        0     0       0       0     0     6      0
8      8            0        0     0       0       0     6     0      0
9      9            0        0     0       0       0     0     6      0
10    10            3        0     0       0       0     0     0      3

我通常只做这件事:

melt(df, id.vars = 'idno') %>% count(idno, value) %>% spread(value, n, 0)

然而,我想知道是否有更直截了当的做法。我的问题是我正在使用一个非常大的数据库并使用melt,然后count然后spread可能会有点慢。

是否有直接的方式count每行的列(变量的分布),最好使用data.table

setDT(df)[,.N,by=] # 

每行的by列?

df = structure(list(idno = 1:10, `08:00` = c("Domestic", "Leisure", 
"Eat", "Paid", "Sleep", "Eat", "Sleep", "Paid", "Sleep", "Child Care"
), `08:05` = c("Domestic", "Leisure", "Eat", "Paid", "Sleep", 
"Eat", "Sleep", "Paid", "Sleep", "Child Care"), `08:10` = c("Domestic", 
"Leisure", "Eat", "Paid", "Sleep", "Eat", "Sleep", "Paid", "Sleep", 
"Child Care"), `08:15` = c("Domestic", "Leisure", "Eat", "Paid", 
"Sleep", "Missing", "Sleep", "Paid", "Sleep", "Travel"), `08:20` =    c("Domestic", 
"Leisure", "Eat", "Paid", "Sleep", "Missing", "Sleep", "Paid", 
"Sleep", "Travel"), `08:25` = c("Domestic", "Leisure", "Eat", 
"Paid", "Sleep", "Missing", "Sleep", "Paid", "Sleep", "Travel"
)), .Names = c("idno", "08:00", "08:05", "08:10", "08:15", "08:20", 
"08:25"), row.names = c(NA, 10L), class = "data.frame")

1 个答案:

答案 0 :(得分:4)

您可以在mtabulate

中尝试qdapTools
library(qdapTools)

mtabulate(split(df[-1], seq(nrow(df))))

#   Child Care Domestic Eat Leisure Missing Paid Sleep Travel
#1           0        6   0       0       0    0     0      0
#2           0        0   0       6       0    0     0      0
#3           0        0   6       0       0    0     0      0
#4           0        0   0       0       0    6     0      0
#5           0        0   0       0       0    0     6      0
#6           0        0   3       0       3    0     0      0
#7           0        0   0       0       0    0     6      0
#8           0        0   0       0       0    6     0      0
#9           0        0   0       0       0    0     6      0
#10          3        0   0       0       0    0     0      3