我有一个这种形式的data.frame:
sequence support
1 a-b 0.6
2 b-c 0.6
3 a-c 0.6
4 a-b-c 1.0
5 a-d 0.6
我可以将其转换为以下内容:
1 2 3 support
1 a b <NA> 0.6
2 b c <NA> 0.6
3 a c <NA> 1.0
4 a b c 0.6
5 a d <NA> 1.0
我需要将上面的表格转换成这样:
1 2 support
1 a b 0.6
2 b c 0.6
3 a d 1.0
更具体地说,我想绘制一个Sankey图。
所以我必须将第一个data.table转换为'start node'和'end node'的形式。
例如,要绘制序列'a-b-c'和'a-d',我需要关注data.frame:
start end
a b
b c
a d
我该怎么做?
答案 0 :(得分:3)
使用strsplit并申请:
# data
df1 <- read.table(text = "sequence support
1 a-b 0.6
2 b-c 0.6
3 a-c 0.6
4 a-b-c 1.0
5 a-d 0.6", header = TRUE, as.is = TRUE)
# result - input for sankey
datSankey <-
do.call(rbind,
apply(df1, 1, function(i){
x <- unlist(strsplit(i[1], "-"))
cbind.data.frame(
From = x[1:length(x) - 1],
To = x[2:(length(x))],
Weight = as.numeric(i[2]),
deparse.level = 0)
})
)
# From To Weight
# 1 a b 0.6
# 2 b c 0.6
# 3 a c 0.6
# 4.sequence1 a b 1.0
# 4.sequence2 b c 1.0
# 5 a d 0.6
# plot
library(googleVis)
plot(gvisSankey(datSankey,
from = "From", to = "To", weight = "Weight"))
答案 1 :(得分:2)
我们可以尝试
library(splitstackshape)
i1 <- grepl("-[^-]+-", df$sequence)
df$sequence[i1] <- sub("-[^-]+", "", df$sequence[i1])
res <- cSplit(df[!(duplicated(df$sequence)|duplicated(df$sequence,
fromLast=TRUE)),], "sequence", "-")
res[, 2:3, with = FALSE]
# sequence_1 sequence_2
#1: a b
#2: b c
#3: a d