我有以下输入格式的巨大数据框。我试图基于定界符“:”分隔各列,并在第1列中输出值以及列号和行值。
input <- structure(list(V1 = structure(1:2, .Label = c("a1", "a2"), class = "factor"),
V2 = structure(1:2, .Label = c("aaa-1-c:bbb-1-d:ccc:a", "www-1-c"
), class = "factor"), V3 = structure(1:2, .Label = c("cc:nnn:ttt-cc",
"cdd:aaa:pp"), class = "factor"), V4 = structure(c(1L, NA
), .Label = "aaa-1-d", class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
我尝试过,但是列号和值的顺序不正确。
output <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L), .Label = c("a1", "a2 "), class = "factor"),
V2 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 1L), V3 = structure(c(3L,
5L, 7L, 1L, 6L, 9L, 11L, 4L, 12L, 8L, 2L, 10L), .Label = c("a",
"aaa", "aaa-1-c", "aaa-1-d", "bbb-1-d", "cc", "ccc", "cdd",
"nnn", "pp", "ttt-cc", "www-1-c"), class = "factor")), class = "data.frame", row.names = c(NA,
-12L))
任何人都可以帮忙。谢谢!
答案 0 :(得分:2)
这里是一个选项,其中我们将数据集的形状从“宽”改成“长”(pivot_longer
-1.0.0中的tidyr
,然后拆分“ V3”列(在长格式):
,并使用match
library(dplyr)
library(tidyr)
input %>%
pivot_longer(cols = -V1, names_to = "V2", values_to = "V3",
values_drop_na = TRUE) %>%
# older versions use gather
# gather(V2, V3, -V1, na.rm = TRUE) %>%
separate_rows(V3, sep=":") %>%
group_by(V1) %>%
mutate(V2 = match(V2, unique(V2))) %>%
ungroup
# A tibble: 12 x 3
# V1 V2 V3
# <fct> <int> <chr>
# 1 a1 1 aaa-1-c
# 2 a1 1 bbb-1-d
# 3 a1 1 ccc
# 4 a1 1 a
# 5 a1 2 cc
# 6 a1 2 nnn
# 7 a1 2 ttt-cc
# 8 a1 3 aaa-1-d
# 9 a2 1 www-1-c
#10 a2 2 cdd
#11 a2 2 aaa
#12 a2 2 pp