使用plyr整合类似行时出错 - 我做错了什么?

时间:2016-08-01 17:26:45

标签: r dplyr plyr consolidation

我有一个数据框(dtetags.df),其日期列有许多重复日期:

dtetags.df$Date
 "2016-07-22" "2016-07-22" "2016-07-21" "2016-07-21" "2016-07-20" "2016-07-20" "2016-07-19" "2016-07-19" "2016-07-18" "2016-07-18" "2016-07-15" "2016-07-15" "2016-07-15" "2016-07-14"
 "2016-07-14" "2016-07-13" "2016-07-13" "2016-07-13" "2016-07-12" "2016-07-12" "2016-07-12" "2016-07-12" "2016-07-11" "2016-07-11" "2016-07-11" "2016-07-11" "2016-07-08" "2016-07-08"
 "2016-07-08" "2016-07-07" "2016-07-07" "2016-07-07" "2016-07-07" "2016-07-06" "2016-07-06" "2016-07-05" "2016-07-05" "2016-07-05" "2016-07-05" "2016-07-01" "2016-07-01" "2016-06-30"
 "2016-06-30" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-29" "2016-06-28" "2016-06-28" "2016-06-28" "2016-06-27" "2016-06-27" "2016-06-27" "2016-06-24" "2016-06-24"
 "2016-06-23" "2016-06-23" "2016-06-22" "2016-06-22" "2016-06-21" "2016-06-21" "2016-06-20" "2016-06-20" "2016-06-17" "2016-06-17" "2016-06-16" "2016-06-16" "2016-06-15" "2016-06-15"
 "2016-06-14" "2016-06-13" "2016-06-13" "2016-06-10" "2016-06-10" "2016-06-09" "2016-06-09" "2016-06-09" "2016-06-09" "2016-06-08" "2016-06-08" "2016-06-07" "2016-06-07" "2016-06-06"
 "2016-06-06" "2016-06-06" "2016-06-01" "2016-06-01" "2016-05-29" "2016-05-29" "2016-05-27" "2016-05-27" "2016-05-26" "2016-05-26" "2016-05-25" "2016-05-25" "2016-05-24" "2016-05-23"
 "2016-05-23" "2016-05-20"

以及一些二进制标记列,用于显示在该日期是否使用该标记创建帖子,例如:

dtetags.df$Technology
 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "1" "1" "0" "1" "0" "1"
 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"

我正在尝试使用基于this questionddply(dtetags.df,"Date",numcolwise(sum)),但它会返回此错误消息<0 rows> (or 0-length row.names)。我已经尝试了许多不同的方法来格式化ddply命令,但我无法让它工作。

理想输出如下:

               Date            Technology
1        2016-07-22                     0
2        2016-07-21                     0
3        2016-07-20                     0
4        2016-07-19                     0
5        2016-07-18                     0
6        2016-07-15                     0
7        2016-07-14                     0
8        2016-07-13                     0
9        2016-07-12                     0
10       2016-07-11                     0
11       2016-07-08                     0
12       2016-07-07                     0
13       2016-07-06                     1
14       2016-07-05                     0
15       2016-07-01                     2
16       2016-06-30                     1
17       2016-06-29                     1
18       2016-06-28                     0
19       2016-06-27                     0
20       2016-06-24                     1
21       2016-06-23                     0
22       2016-06-22                     0
23       2016-06-21                     0
24       2016-06-20                     0
25       2016-06-17                     0
26       2016-06-16                     0
27       2016-06-15                     0
28       2016-06-14                     1
29       2016-06-13                     0
30       2016-06-10                     0
31       2016-06-09                     0
32       2016-06-08                     0
33       2016-06-07                     0
34       2016-06-06                     0
35       2016-06-01                     0
36       2016-05-29                     0
37       2016-05-27                     0
38       2016-05-26                     0
39       2016-05-25                     0
40       2016-05-24                     0
41       2016-05-23                     0
42      2016-05-20                      0

有什么明显的事我做错了吗?

从因素转换为数字

我删除了Date列,将data.frame(apply(dtetags.df, 2, function(x) as.numeric(as.character(x))))应用于数据框的其余部分,并将Date列重新添加到其中。

dput(dtetags.df)
structure(list(Date = c("2016-07-22", "2016-07-22", "2016-07-21", 
"2016-07-21", "2016-07-20", "2016-07-20", "2016-07-19", "2016-07-19", 
"2016-07-18", "2016-07-18", "2016-07-15", "2016-07-15", "2016-07-15", 
"2016-07-14", "2016-07-14", "2016-07-13", "2016-07-13", "2016-07-13", 
"2016-07-12", "2016-07-12", "2016-07-12", "2016-07-12", "2016-07-11", 
"2016-07-11", "2016-07-11", "2016-07-11", "2016-07-08", "2016-07-08", 
"2016-07-08", "2016-07-07", "2016-07-07", "2016-07-07", "2016-07-07", 
"2016-07-06", "2016-07-06", "2016-07-05", "2016-07-05", "2016-07-05", 
"2016-07-05", "2016-07-01", "2016-07-01", "2016-06-30", "2016-06-30", 
"2016-06-29", "2016-06-29", "2016-06-29", "2016-06-29", "2016-06-29", 
"2016-06-28", "2016-06-28", "2016-06-28", "2016-06-27", "2016-06-27", 
"2016-06-27", "2016-06-24", "2016-06-24", "2016-06-23", "2016-06-23", 
"2016-06-22", "2016-06-22", "2016-06-21", "2016-06-21", "2016-06-20", 
"2016-06-20", "2016-06-17", "2016-06-17", "2016-06-16", "2016-06-16", 
"2016-06-15", "2016-06-15", "2016-06-14", "2016-06-13", "2016-06-13", 
"2016-06-10", "2016-06-10", "2016-06-09", "2016-06-09", "2016-06-09", 
"2016-06-09", "2016-06-08", "2016-06-08", "2016-06-07", "2016-06-07", 
"2016-06-06", "2016-06-06", "2016-06-06", "2016-06-01", "2016-06-01", 
"2016-05-29", "2016-05-29", "2016-05-27", "2016-05-27", "2016-05-26", 
"2016-05-26", "2016-05-25", "2016-05-25", "2016-05-24", "2016-05-23", 
"2016-05-23", "2016-05-20"), `Technology` = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Date", 
"Technology"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -100L))

1 个答案:

答案 0 :(得分:0)

要完成您想要的任务,您可以使用dplyr包:

library(dplyr)
out <- dtetags.df %>% group_by(Date) %>% summarise_each(funs(sum)) %>% arrange(desc(Date))

注意:

  1. group_by Date,表示后续操作将在具有相同日期的行组上。
  2. 使用sum函数汇总每一列(Date除外)。
  3. 使用arrange按日期降序对结果进行排序。
  4. 根据输入数据,输出符合预期:

    print(out)
    # A tibble: 42 x 2
         Date     Technology
        <chr>          <dbl>
    1  2016-07-22          0
    2  2016-07-21          0
    3  2016-07-20          0
    4  2016-07-19          0
    5  2016-07-18          0
    6  2016-07-15          0
    7  2016-07-14          0
    8  2016-07-13          0
    9  2016-07-12          0
    10 2016-07-11          0
    11 2016-07-08          0
    12 2016-07-07          0
    13 2016-07-06          1
    14 2016-07-05          0
    15 2016-07-01          2
    16 2016-06-30          1
    17 2016-06-29          1
    18 2016-06-28          0
    19 2016-06-27          0
    20 2016-06-24          1
    21 2016-06-23          0
    22 2016-06-22          0
    23 2016-06-21          0
    24 2016-06-20          0
    25 2016-06-17          0
    26 2016-06-16          0
    27 2016-06-15          0
    28 2016-06-14          1
    29 2016-06-13          0
    30 2016-06-10          0
    31 2016-06-09          0
    32 2016-06-08          0
    33 2016-06-07          0
    34 2016-06-06          0
    35 2016-06-01          0
    36 2016-05-29          0
    37 2016-05-27          0
    38 2016-05-26          0
    39 2016-05-25          0
    40 2016-05-24          0
    41 2016-05-23          0
    42 2016-05-20          0
    

    警告:这要求Datedtetags.df以外的所有行都为numeric。如果不是,则应在应用此代码之前对其进行转换。这可以使用找到的答案here

    来完成

    希望这有帮助。