data.frame中的'(a-b-c)'到'(a-b)和(b-c)'

时间:2016-07-19 05:16:37

标签: r dataframe sequence visualization

我有一个这种形式的data.frame:

  sequence support
1      a-b     0.6
2      b-c     0.6
3      a-c     0.6
4    a-b-c     1.0
5      a-d     0.6

我可以将其转换为以下内容:

  1    2    3 support
1 a    b <NA>     0.6
2 b    c <NA>     0.6
3 a    c <NA>     1.0
4 a    b    c     0.6
5 a    d <NA>     1.0

我需要将上面的表格转换成这样:

  1    2  support
1 a    b      0.6
2 b    c      0.6
3 a    d      1.0

更具体地说,我想绘制一个Sankey图。

所以我必须将第一个data.table转换为'start node'和'end node'的形式。

例如,要绘制序列'a-b-c'和'a-d',我需要关注data.frame:

start end
    a   b
    b   c
    a   d

我该怎么做?

2 个答案:

答案 0 :(得分:3)

使用strsplit并申请:

# data
df1 <- read.table(text = "sequence support
1      a-b     0.6
2      b-c     0.6
3      a-c     0.6
4    a-b-c     1.0
5      a-d     0.6", header = TRUE, as.is = TRUE)

# result - input for sankey
datSankey <-
  do.call(rbind,
          apply(df1, 1, function(i){
            x <- unlist(strsplit(i[1], "-"))
            cbind.data.frame(
              From = x[1:length(x) - 1],
              To = x[2:(length(x))],
              Weight = as.numeric(i[2]),
              deparse.level = 0)
          })
  )

#             From To Weight
# 1              a  b    0.6
# 2              b  c    0.6
# 3              a  c    0.6
# 4.sequence1    a  b    1.0
# 4.sequence2    b  c    1.0
# 5              a  d    0.6

# plot
library(googleVis)
plot(gvisSankey(datSankey,
                from = "From", to = "To", weight = "Weight"))

enter image description here

答案 1 :(得分:2)

我们可以尝试

library(splitstackshape)
i1 <- grepl("-[^-]+-", df$sequence)
df$sequence[i1] <- sub("-[^-]+", "", df$sequence[i1])
res <- cSplit(df[!(duplicated(df$sequence)|duplicated(df$sequence, 
               fromLast=TRUE)),], "sequence", "-")
res[, 2:3, with = FALSE]
#   sequence_1 sequence_2
#1:          a          b
#2:          b          c
#3:          a          d