stat_alluvium()中的计算失败:输出的每一行都必须由键的唯一组合来标识

时间:2019-11-20 19:46:36

标签: r ggplot2 dplyr tidyverse

我有一个用Tidyverse工具集合构建的data.frame,大多数是带有管道的dplyr工具。对于可以发现ggalluvial的geom_flow()示例,数据看起来格式正确。从MSSQL数据库导入后,我的数据集经历了一堆迭代,大约有30万行。因此,我创建了一个虚拟版本,当我开始将其设置为ggalluvial时报告所有相同的类和格式,我还这样做是为了查看该错误是否可以在较小的范围内重新创建以更好地进行故障排除。

data <- data.frame(Employee = as.numeric(c(1450,1450,1450,1450,1460,1460,1460,1460,1470,1470)),
                  PostDate = as.POSIXct(c("2019-08-15","2019-09-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12")),
                  Job = as.character(c("1901", "1901","1902","1902","1901", "1901","1902","1902","1901", "1901")),
                  Phase = as.character(c("950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-")),
                  Craft = as.character(c("Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab")),
                  Class = as.character(c("1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B")),
                  EarnCode  = as.numeric(c("51", "51", "51", "51", "51", "51", "51", "51", "51", "51")),
                  Hours = as.numeric(c(8, 8, 7, 6, 5, 4, 12, 3, 8, 9)),
                  Rate = as.numeric(c(50, 50, 50, 50, 50, 50, 50, 50, 50, 50)),
                  Amt = as.numeric(c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)),
                  LastName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")),
                  FirstName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")), stringsAsFactors=FALSE)

大多数列都被快速消除,但是为了将数据简化为我想要的数据,我将以下整洁的东西应用于:

df_data_Aluv <- data %>%
        filter(PostDate >= "2019-08-01" & PostDate <= "2019-10-30") %>%
        select(date = PostDate, Employee, Job, Hours) %>%
        group_by(Job, month = as.character(floor_date(date, "month")), Employee) %>%
        distinct(month, Job, Employee, .keep_all = TRUE) %>%
        summarize(freq = n_distinct(Employee)) 

该想法是按月和sankey图按月汇总员工人数。我期望的绘图代码是:

ggplot(df_data_Aluv,
           aes(x = month, 
               stratum = Job, 
               alluvium = Employee,
               y = freq,
               fill = Job, 
               label = Job)) +
        scale_x_discrete(expand = c(.1, .1)) +
        geom_flow(stat = "alluvium", 
                  lode.guidance = "frontback",
                  color = "darkgray") +
        geom_stratum(alpha = .5) +
        geom_text(stat = "stratum", size = 3) +
        theme(legend.position = "bottom") +
        ggtitle("Project month responses at three points in time")

preEdit编辑:好吧,我在键入这个字时,我发现Job正在创建一个条件,在该条件下,将在图表的每个条形图的两个位置对雇员进行计数。我认为这可能是问题的一部分,所以我重新调整了测试数据并测试了更改为以下数据:

        data <- data.frame(Employee = as.numeric(c(1450,1450,1450,1450,1460,1460,1460,1460,1470,1470)),
                       PostDate = as.POSIXct(c("2019-08-15","2019-08-12","2019-09-15","2019-10-12","2019-08-15","2019-08-12","2019-09-15","2019-10-12","2019-08-15","2019-09-12")),
                       Job = as.character(c("1901", "1901","1902","1902","1901", "1901","1902","1902","1901", "1901")),
                       Phase = as.character(c("950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-", "950-")),
                       Craft = as.character(c("Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab", "Lab")),
                       Class = as.character(c("1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B", "1B")),
                       EarnCode  = as.numeric(c("51", "51", "51", "51", "51", "51", "51", "51", "51", "51")),
                       Hours = as.numeric(c(8, 8, 7, 6, 5, 4, 12, 3, 8, 9)),
                       Rate = as.numeric(c(50, 50, 50, 50, 50, 50, 50, 50, 50, 50)),
                       Amt = as.numeric(c(100, 100, 100, 100, 100, 100, 100, 100, 100, 100)),
                       LastName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")),
                       FirstName = as.character(c("bill", "bill", "bill", "bill", "mike", "mike", "mike", "mike", "joe", "joe")), stringsAsFactors=FALSE)

我能够重新整理我的dplyr内容,以便以太能成功,有效地选择了给定月份中应计入雇员的哪个工作,并以最大小时数进行过滤,因此在此更改之后,两个示例数据集都可以使用以下代码:

    df_data_Aluv <- data %>%
        filter(PostDate >= "2019-08-01" & PostDate <= "2019-10-30") %>%
        select(date = PostDate, Employee, Job, Hours) %>%
        group_by(Job, month = as.character(floor_date(date, "month")), Employee) %>%
        summarize(freq = n_distinct(Employee), Hours = sum(Hours)) %>%
        group_by(Employee, month) %>%
        filter(Hours == max(Hours))

因此,我返回到原始的30万行数据集,并对其应用dplyr步骤,在什么地方获得了287的简化数据帧和

的错误消息
Each row of output must be identified by a unique combination of keys.
Keys are shared for 287 rows:
\* 1, 2
\* 3, 4
\* 5, 6
 ... (lists ever row this way)

现在我将减少到287行,但现在的错误是

Each row of output must be identified by a unique combination of keys.
Keys are shared for 1 rows:
\* 109, 110

在Rstudios View()中查看这两行,我看不到为什么它仍将它们标记为共享密钥。


    106 1906-   2019-10-01  4267    1   91.5;
    107 1906-   2019-10-01  4317    1   119.0
    108 1907-   2019-08-01  582     1   406.0
    109  1907-   2019-08-01  705     1   396.0
    110  1907-   2019-08-01  1224    1   229.5
    111 1907-   2019-08-01  1700    1   179.5
    112 1907-   2019-08-01  1744    1   235.0
    113 1907-   2019-08-01  1959    1   234.5

任何避免此错误的建议都将非常有帮助。我发现对它的搜索非常令人沮丧,因为绝大多数搜索结果都是针对spread(),并且显然没有关联。可能是我的dplyr之一,或者是ggalluvial命令正在使用它几层,但它的级别无法解决。

是否有避免前期错误的好方法?为什么当我的109行和110行显然彼此不重复时,仍将其标记为重复。最终它将进入一个闪亮的应用程序,因此我的解决方案需要在用户输入日期范围内保持稳定。

1 个答案:

答案 0 :(得分:0)

我对这种解决方案不满意,但至少可以正常工作。我要分组的两个变量在具有group_by(Employee,month)%>%的情况下无法工作,因此我仅对分组但变异的组合变量添加了一个mutate,然后使用distinct确保组合变量没有重复。 / p>

    df_data_Aluv <- data %>%
        filter(PostDate >= "2019-08-01" & PostDate <= "2019-10-30") %>%
        select(date = PostDate, Employee, Job, Hours) %>%
        group_by(Job, month = as.character(floor_date(date, "month")), Employee) %>%
        summarize(freq = n_distinct(Employee), Hours = sum(Hours)) %>%
        mutate(empmon = paste(Employee, " -- ", month)) %>%
        group_by(empmon) %>%
        filter(Hours == max(Hours)) %>%
        distinct(empmon, .keep_all = TRUE)  

它适用于任何日期范围,所以我至少得到了我想要的

enter image description here