转换Dataframe以在ggplot2中生成瀑布图

时间:2017-03-27 15:37:48

标签: r ggplot2 dplyr waterfall

我想将我的数据帧转换为适合瀑布图的格式。

我的数据框如下:

employee <- c('A','B','C','D','E','F', 
              'A','B','C','D','E','F',
              'A','B','C','D','E','F',
              'A','B','C','D','E','F',)
revenue <- c(10, 20, 30, 40, 10, 40, 
              8, 10, 20, 50, 20, 10,
              2,  5, 70, 30, 10, 50,
             40,  8, 30, 40, 10, 40)
date <- as.Date(c('2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-04','2017-03-04','2017-03-04',
                  '2017-03-04','2017-03-04','2017-03-04'))
df<-data.frame(date,employee,revenue)

         date employee revenue
1  2017-03-01        A      10
2  2017-03-01        B      20
3  2017-03-01        C      30
4  2017-03-01        D      40
5  2017-03-01        E      10
6  2017-03-01        F      40
7  2017-03-02        A       8
8  2017-03-02        B      10
9  2017-03-02        C      20
10 2017-03-02        D      50
11 2017-03-02        E      20
12 2017-03-02        F      10
13 2017-03-03        A       2
14 2017-03-03        B       5
15 2017-03-03        C      70
16 2017-03-03        D      30
17 2017-03-03        E      10
18 2017-03-03        F      50
19 2017-03-04        A      40
20 2017-03-04        B       8
21 2017-03-04        C      30
22 2017-03-04        D      40
23 2017-03-04        E      10
24 2017-03-04        F      40

如何转换此数据框,以便我可以将其转换为ggplot2中瀑布图的表单。

amount列与员工总天数不同。

end列是start列减去amount列。

start列是前一天的Total结束值。

最终的数据框应如下所示:

         date employee     start    end    amount    total_for_day
1  2017-03-01        A         0     10        10               10
2  2017-03-01        B         0     20        20               20
3  2017-03-01        C         0     30        30               30
4  2017-03-01        D         0     40        40               40
5  2017-03-01        E         0     10        10               10
6  2017-03-01        F         0     40        40               40
7  2017-03-01    Total         0    150       150              150
8  2017-03-02        A       150    148        -2                8
9  2017-03-02        B       150    140       -10               10
10 2017-03-02        C       150    140       -10               20
11 2017-03-02        D       150    160        10               50 
12 2017-03-02        E       150    160        10               20
13 2017-03-02        F       150    120       -30               10  
14 2017-03-02    Total       150    118       -32               98
15 2017-03-03        A       118    112        -6                2                      
16 2017-03-03        B       118    113        -5                5                  
17 2017-03-03        C       118    168        50               70
18 2017-03-03        D       118     98       -20               30  
19 2017-03-03        E       118    108       -10               10  
20 2017-03-03        F       118    158        40               50
21 2017-03-03    Total       118    167        49              170  
22 2017-03-04        A       167    205        38               40
23 2017-03-04        B       167    170         3                8
24 2017-03-04        C       167    127       -40               30
25 2017-03-04        D       167    177        10               40
26 2017-03-04        E       167    167         0               10
27 2017-03-04        F       167    157       -10               40 
28 2017-03-04    Total       167    168         1              168

1 个答案:

答案 0 :(得分:3)

有几个步骤可以帮助您实现这一目标,我认为dplyr包会有所帮助(在下面大量使用)。

我的理解是revenue给出累计总收入,而不是每日变化。如果这是错误的,您需要撤销其中一些计算。

第一步是创建一个新的data.frame,计算每日总数,然后将其绑定回data.frame。然后,您可以group_by员工(包括&#34; Total&#34;)并添加将为每个员工单独创建的列(前一天的值,更改,然后是否增加或减少)。

toPlot <-
  bind_rows(
    df
    , df %>%
      group_by(date) %>%
      summarise(revenue = sum(revenue)) %>%
      mutate(employee = "Total") 
  ) %>%
  group_by(employee) %>%
  mutate(
    previousDay = lag(revenue, default = 0) 
    , change = revenue - previousDay
    , direction = ifelse(change > 0
                         , "Positive"
                         , "Negative"))

返回:

         date employee revenue previousDay change direction
       <date>    <chr>   <dbl>       <dbl>  <dbl>     <chr>
1  2017-03-01        A      10           0     10  Positive
2  2017-03-01        B      20           0     20  Positive
3  2017-03-01        C      30           0     30  Positive
4  2017-03-01        D      40           0     40  Positive
5  2017-03-01        E      10           0     10  Positive
6  2017-03-01        F      40           0     40  Positive
7  2017-03-02        A       8          10     -2  Negative
8  2017-03-02        B      10          20    -10  Negative
9  2017-03-02        C      20          30    -10  Negative
10 2017-03-02        D      50          40     10  Positive
# ... with 18 more rows

然后,我们可以使用:

绘制
toPlot %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  facet_wrap(~employee
             , scale = "free_y") +
  scale_fill_brewer(palette = "Set1")

给予

enter image description here

请注意,包括&#34; Total&#34;抛出刻度(需要自由刻度),所以我宁愿省略它:

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  facet_wrap(~employee) +
  scale_fill_brewer(palette = "Set1")

为此允许员工之间的直接比较

enter image description here

这是总的

toPlot %>%
  filter(employee == "Total") %>%
  ggplot(aes(xmin = date - 0.5
             , xmax = date + 0.5
             , ymin = previousDay
             , ymax = revenue
             , fill = direction)) +
  geom_rect(col = "black"
            , show.legend = FALSE) +
  scale_fill_brewer(palette = "Set1")

enter image description here

虽然我仍然觉得线图更容易理解(特别是比较员工):

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(x = date
             , y = revenue
             , col = employee)) +
  geom_line() +
  scale_fill_brewer(palette = "Dark2")

enter image description here

如果您想在白天绘制更改,您可以执行以下操作:

toPlot %>%
  filter(employee != "Total") %>%
  ggplot(aes(x = date
             , y = change
             , fill = employee)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Dark2")

得到:

enter image description here

但现在你离'&#34;瀑布&#34;情节输出。如果你真的,真的想让瀑布可以与你的情节形成鲜明对比,但它会变得相当丑陋(我强烈强烈推荐上面的线条图)。

在这里,您需要手动移动框,如果您更改输出宽高比(或大小)或员工人数,则需要进行一些修改。您还需要为员工和变更方向添加颜色,这些颜色开始变得粗糙。这属于&#34;可以,但可能不应该&#34;#34; - 可能有更好的方式来显示这些数据。

toPlot %>%
  filter(employee != "Total") %>%
  ungroup() %>%
  mutate(empNumber = as.numeric(as.factor(employee))) %>%
  ggplot(aes(xmin = (empNumber) - 0.4
             , xmax = (empNumber) + 0.4
             , ymin = previousDay
             , ymax = revenue
             , col = direction
             , fill = employee)) +
  geom_rect(size = 1.5) +
  facet_grid(~date) +
  scale_fill_brewer(palette = "Dark2") +
  theme(axis.text.x = element_blank()
        , axis.ticks.x = element_blank())

给出

enter image description here