具有多个列和权重列的Sankey图-使用NetworkD3软件包

时间:2018-09-18 08:07:56

标签: r sankey-diagram htmlwidgets networkd3

我正在尝试使用软件包制作交互式Sankey。 我有一个包含八列的数据集。

df <- read.csv(header = TRUE, as.is = TRUE, text = '
clientcode,year1,year2,year3,year4,year5,year6,year7
1,DBC,DBBC,DBBC,DBC,DBC,"Not in care","Not in care"
2,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
3,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
4,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
5,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
')

我在这篇文章中使用下面的代码,其开头为“这个问题很多……”: https://stackoverflow.com/a/52237151/4389763

这是我的代码:

df <- df %>% select(year1,year2,year3,year4,year5,year6,year7) 

links <-
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row) %>%
mutate(column = match(column, names(df))) %>%
group_by(row) %>%
arrange(column) %>%
mutate(target = lead(source)) %>%
ungroup() %>%
filter(!is.na(target))

links <-
links %>%
mutate(source = paste0(source, '_', column)) %>%
mutate(target = paste0(target, '_', column + 1)) %>%
select(source, target)

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
          Target = 'target', Value = 'value', NodeID = 'name')

但是我不知道如何增加流程的价值。例如,从DBC到DBBC在year1到year2中发生了五次。从2年到3年,DBBC到DBBC发生了3次。通过上面的代码,我看到每次出现的次数均为1,我希望看到流的总值。

像Sankey的this example。在这里,您可以查看例如group_A到group_C的总数,而不是每个事件的总数。

是否可以在鼠标上方查看百分比?例如,Year1 = DBC到Year2 = DBBC值是5的5,百分比是100%。

有人可以帮助我吗?谢谢。

2 个答案:

答案 0 :(得分:0)

我更改了代码:

代替:

links$value <- 1

新代码:

links <- links %>% group_by(source, target) %>% tally()
names(links)[3] <- "value"

答案 1 :(得分:0)

问题的第一部分-如何从在每行跨越几列定义了多个链接/边的数据集中获取链接(源列和目标列)的数据集-由{{3} }(添加的一小部分是从额外的列clientcode开始,该列不包含链接信息,因此需要首先将其删除)。

df <- read.csv(header = TRUE, as.is = TRUE, text = '
clientcode,year1,year2,year3,year4,year5,year6,year7
1,DBC,DBBC,DBBC,DBC,DBC,"Not in care","Not in care"
2,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
3,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
4,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
5,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
')

library(dplyr)
library(tidyr)

links <-
  df %>%
  select(-clientcode) %>% 
  mutate(row = row_number()) %>%
  gather('column', 'source', -row) %>%
  mutate(column = match(column, names(df))) %>%
  group_by(row) %>%
  arrange(column) %>%
  mutate(target = lead(source)) %>%
  ungroup() %>%
  filter(!is.na(target)) %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>%
  select(source, target)

links

# # A tibble: 30 x 2
#    source target       
#    <chr>  <chr>        
#  1 DBC_2  DBBC_3       
#  2 DBC_2  DBBC_3       
#  3 DBC_2  DBBC_3       
#  4 DBC_2  DBBC_3       
#  5 DBC_2  DBBC_3       
#  6 DBBC_3 DBBC_4       
#  7 DBBC_3 DBBC_4       
#  8 DBBC_3 Not in care_4
#  9 DBBC_3 Not in care_4
# 10 DBBC_3 DBBC_4       
# # ... with 20 more rows

问题的第二部分本质上是,通过单个链接的数据集,我如何将类似的链接聚合到一个链接中,并用一个值列指示该链接中聚合了多少个单个链接。这可以通过将sourcetarget列进行分组并汇总行数来实现。

links %>% 
  group_by(source, target) %>% 
  summarise(value = n())

# # A tibble: 11 x 3
# # Groups:   source [?]
#    source        target        value
#    <chr>         <chr>         <int>
#  1 DBBC_3        DBBC_4            3
#  2 DBBC_3        Not in care_4     2
#  3 DBBC_4        DBC_5             1
#  4 DBBC_4        Not in care_5     2
#  5 DBC_2         DBBC_3            5
#  6 DBC_5         DBC_6             1
#  7 DBC_6         Not in care_7     1
#  8 Not in care_4 Not in care_5     2
#  9 Not in care_5 Not in care_6     4
# 10 Not in care_6 Not in care_7     4
# 11 Not in care_7 Not in care_8     5

由于要显示百分比,而不是计数,因此可以稍加修改以计算每年所有链接的百分比,然后使用unit = "%"的{​​{1}}参数,以便正确显示。

sankeyNetwork

the answer that you linked to