我有一个data.frame,其中每一行都是一组人,介于2到8之间。 从这里我想创建一个图表,显示所有人作为顶点和边缘显示这两个人在我的原始data.frame中的一行中出现。 图表不是问题,只是创建边缘列表。 我的问题似乎非常接近Creating an edgelist from Patent data in R ,但是因为它是我在R的第一天,所以无法缩小那里的差距。
我的data
看起来像这样:
name_1 name_2 name_3 name_4 name_5
jan tim
tom tim bernd
stefen tom tim jan bernd
marcel bernd
dput(data)
的输出:
structure(list(name_1 = structure(c(1L, 4L, 3L, 2L), .Label = c("jan",
"marcel", "stefen", "tom"), class = "factor"), name_2 = structure(c(2L,
2L, 3L, 1L), .Label = c("bernd", "tim", "tom"), class = "factor"),
name_3 = structure(c(1L, 2L, 3L, 1L), .Label = c("", "bernd",
"tim"), class = "factor"), name_4 = structure(c(1L, 1L, 2L,
1L), .Label = c("", "jan"), class = "factor"), name_5 = structure(c(1L,
1L, 2L, 1L), .Label = c("", "bernd"), class = "factor")), .Names = c("name_1",
"name_2", "name_3", "name_4", "name_5"), class = "data.frame", row.names = c(NA,
-4L))
作为所需的输出我希望有类似的东西:
jan tim
tom tim
tom bernd
tim bernd
stefen tom
stefen tim
stefen jan
stefen bernd
tom tim
tom jan
tom bernd
tim jan
tim bernd
marcel bernd
重复对我来说没问题。
答案 0 :(得分:1)
以下是一些选项。首先在基数R中,您可以使用apply
迭代行和combn
来获得组合。由于您的数据处于尴尬的配置状态,因此您需要大量代码将其拼接在一起,例如,
df <- data.frame(name_1 = c("jan", "tom", "stefen", "marcel"),
name_2 = c("tim", "tim", "tom", "bernd"),
name_3 = c("", "bernd", "tim", ""),
name_4 = c("", "", "jan", ""),
name_5 = c("", "", "bernd", ""))
as.data.frame(
do.call(rbind,
apply(df, 1, function(x){
x <- x[x != ''];
t(combn(x, 2))
})),
stringsAsFactors = FALSE)
#> V1 V2
#> 1 jan tim
#> 2 tom tim
#> 3 tom bernd
#> 4 tim bernd
#> 5 stefen tom
#> 6 stefen tim
#> 7 stefen jan
#> 8 stefen bernd
#> 9 tom tim
#> 10 tom jan
#> 11 tom bernd
#> 12 tim jan
#> 13 tim bernd
#> 14 jan bernd
#> 15 marcel bernd
从技术上讲,最好使用Map
而不是apply
(强制转换为矩阵),但这只需要更多胶水代码:
as.data.frame(do.call(rbind,
do.call(Map,
c(function(...){
x <- c(...);
x <- x[x != ''];
t(combn(x, 2))
},
lapply(df, as.character)))),
stringsAsFactors = FALSE)
#> V1 V2
#> 1 jan tim
#> 2 tom tim
#> 3 tom bernd
#> 4 tim bernd
#> 5 stefen tom
#> 6 stefen tim
#> 7 stefen jan
#> 8 stefen bernd
#> 9 tom tim
#> 10 tom jan
#> 11 tom bernd
#> 12 tim jan
#> 13 tim bernd
#> 14 jan bernd
#> 15 marcel bernd
这两种方法在矩阵和数据帧之间交替,这很难跟踪。您可以编写一种完全避免数据框架的方法,但它的时间更长。
更漂亮的选择是使用tidyverse。首先,将数据整理成长形式:
library(tidyverse)
df_tidy <- df %>%
mutate_all(as.character) %>%
mutate_all(na_if, '') %>%
rowid_to_column() %>%
gather(col, name, -rowid) %>%
drop_na(name)
df_tidy
#> rowid col name
#> 1 1 name_1 jan
#> 2 2 name_1 tom
#> 3 3 name_1 stefen
#> 4 4 name_1 marcel
#> 5 1 name_2 tim
#> 6 2 name_2 tim
#> 7 3 name_2 tom
#> 8 4 name_2 bernd
#> 10 2 name_3 bernd
#> 11 3 name_3 tim
#> 15 3 name_4 jan
#> 19 3 name_5 bernd
好多了。从这一点来看,甚至基本方法也更容易,例如。
as.data.frame(do.call(rbind,
aggregate(name ~ rowid, df_tidy,
function(x){list(t(combn(x, 2)))})$name))
或者,继续使用tidyverse,使用combn
:
df_tidy %>%
group_by(rowid) %>%
summarise(name = list(combn(name, 2, compose(as_data_frame, t), simplify = FALSE)),
name = map(name, bind_rows)) %>%
unnest(name)
#> # A tibble: 15 x 3
#> rowid V1 V2
#> <int> <chr> <chr>
#> 1 1 jan tim
#> 2 2 tom tim
#> 3 2 tom bernd
#> 4 2 tim bernd
#> 5 3 stefen tom
#> 6 3 stefen tim
#> 7 3 stefen jan
#> 8 3 stefen bernd
#> 9 3 tom tim
#> 10 3 tom jan
#> 11 3 tom bernd
#> 12 3 tim jan
#> 13 3 tim bernd
#> 14 3 jan bernd
#> 15 4 marcel bernd
......或者效率较低但不太复杂的方法:
df_tidy %>%
group_by(rowid) %>%
mutate(name2 = list(name)) %>%
unnest() %>%
filter(name < name2)
#> # A tibble: 15 x 4
#> # Groups: rowid [4]
#> rowid col name name2
#> <int> <chr> <chr> <chr>
#> 1 1 name_1 jan tim
#> 2 3 name_1 stefen tom
#> 3 3 name_1 stefen tim
#> 4 2 name_2 tim tom
#> 5 4 name_2 bernd marcel
#> 6 2 name_3 bernd tom
#> 7 2 name_3 bernd tim
#> 8 3 name_3 tim tom
#> 9 3 name_4 jan stefen
#> 10 3 name_4 jan tom
#> 11 3 name_4 jan tim
#> 12 3 name_5 bernd stefen
#> 13 3 name_5 bernd tom
#> 14 3 name_5 bernd tim
#> 15 3 name_5 bernd jan
tidyr::complete
可以类似的方式使用。
答案 1 :(得分:1)
这里我们使用gtools包中的combinations
函数。
df1是给定的数据框。
# convert factor columns to character
df1 <- sapply(df1, as.character)
# get names per row which are not blank
df1 <- apply(df1, 1, function(x) toString(x[x != '']))
# save output to answer
answer <- list()
# we append the combination of names
answer <- sapply(seq(df1), function(x) {
print(df1[x])
val <- unlist(strsplit(df1[x], split = ','))
answer[[x]] <- combinations(n = length(val), r = 2, v = val,repeats.allowed = F)
})
# convert the list to df
df2 <- do.call('rbind', answer)
print(df2)
[1,] " tim" "jan"
[2,] " bernd" " tim"
[3,] " bernd" "tom"
[4,] " tim" "tom"
[5,] " bernd" " jan"
[6,] " bernd" " tim"
[7,] " bernd" " tom"
[8,] " bernd" "stefen"
[9,] " jan" " tim"
[10,] " jan" " tom"
[11,] " jan" "stefen"
[12,] " tim" " tom"
[13,] " tim" "stefen"
[14,] " tom" "stefen"
[15,] " bernd" "marcel"