我有一个格式异常的数据数据框,其中信息存储为列名的一部分。
library(tidyverse)
Ihave <- frame_data(
~ID,~group,~AAA_info2_BBB,~CCC_info3_DDD,
"first", 1, as.Date("1970-01-01"), as.Date("1970-01-02"),
"second", 2, as.Date("1971-01-01"), as.Date("1971-01-02"),
"third", 3, as.Date("1972-01-01"), as.Date("1972-01-02"),
)
# A tibble: 3 x 4
ID group AAA_info2_BBB CCC_info3_DDD
<chr> <dbl> <date> <date>
1 first 1 1970-01-01 1970-01-02
2 second 2 1971-01-01 1971-01-02
3 third 3 1972-01-01 1972-01-02
我需要在数据框中重新获取信息,如下所示
Iwant <- frame_data(
~ID,~group,~source,~variable,~value,~period,
"first", 1, "AAA", "info1", as.Date("1970-01-01"), "BBB",
"second", 2, "AAA", "info1", as.Date("1971-01-01"), "BBB",
"third", 3, "AAA", "info1", as.Date("1972-01-01"), "BBB",
"first", 1, "CCC", "info2", as.Date("1970-01-02"), "DDD",
"second", 2, "CCC", "info2", as.Date("1971-01-02"), "DDD",
"third", 3, "CCC", "info2", as.Date("1972-01-02"), "DDD",
)
# A tibble: 6 x 6
ID group source variable value period
<chr> <dbl> <chr> <chr> <date> <chr>
1 first 1 AAA info1 1970-01-01 BBB
2 second 2 AAA info1 1971-01-01 BBB
3 third 3 AAA info1 1972-01-01 BBB
4 first 1 CCC info2 1970-01-02 DDD
5 second 2 CCC info2 1971-01-02 DDD
6 third 3 CCC info2 1972-01-02 DDD
我可以通过编写一次处理“ AAA_info2_BBB”类型的列之一的函数来工作,但似乎可以使用以下函数来工作
my_fun <- function(df, one_var) {
# Get string from called column name
one_var_char <-
enquo(one_var) %>%
{ as.character(.)[2] }
# Split string across "_" and return character vector
one_var_char_splitted <-
one_var_char %>%
{ strsplit(., "_")[[1]] }
new_one_var <- one_var_char_splitted[2]
names(df)[names(df) == one_var_char] <- new_one_var
df %>%
select(new_one_var) %>%
data.frame(source = one_var_char_splitted[1],
period = one_var_char_splitted[3] )
}
(预期的)退货
Ihave %>%
select(ID, group, AAA_info2_BBB) %>%
my_fun(AAA_info2_BBB)
info2 source period
1 1970-01-01 AAA BBB
2 1971-01-01 AAA BBB
3 1972-01-01 AAA BBB
但是我无法设法将此函数“映射”到Ihave
数据帧上以生成所需的Iwant
。我尝试了purrr::map
的几种混合方法,但均未成功。我的方法有缺陷吗?我错过了什么吗?
任何帮助,不胜感激!
答案 0 :(得分:3)
我在看到@aosmith的评论之前就做了这一点,这是当场出现的:
library(dplyr)
library(tidyr)
Ihave %>%
gather(source, value, -ID, -group) %>%
separate(source, into = c("source", "variable", "period"), sep = "_")
# # A tibble: 6 x 6
# ID group source variable period value
# <chr> <dbl> <chr> <chr> <chr> <date>
# 1 first 1 AAA info2 BBB 1970-01-01
# 2 second 2 AAA info2 BBB 1971-01-01
# 3 third 3 AAA info2 BBB 1972-01-01
# 4 first 1 CCC info3 DDD 1970-01-02
# 5 second 2 CCC info3 DDD 1971-01-02
# 6 third 3 CCC info3 DDD 1972-01-02
它依赖于_
分隔的字段的数目是恒定的,有序的并且是已知的。如果格式从不改变,那就很好。否则,您需要写一些更具体的/习惯来处理任何变化。
如果您已经在加载library(dplyr)
,则无需显式调用tidyr
或library(tidyverse)
。 (我将它们包括在这里,以防(a)有人出现并没有明确加载所有25个软件包,或者(b)您以为您需要所有这些软件包,但想通过修剪未使用的软件包来减少加载时间。)
答案 1 :(得分:1)
与gather
然后是separate
相同,但是仅出于多样性考虑,这里是使用data.table
和melt
tstrsplit
方法
library(data.table)
setDT(Ihave)
melt(Ihave, c('ID', 'group'))[,
c('source', 'variable', 'period') := tstrsplit(variable, '_')]
# ID group variable value source period
# 1: first 1 info2 1970-01-01 AAA BBB
# 2: second 2 info2 1971-01-01 AAA BBB
# 3: third 3 info2 1972-01-01 AAA BBB
# 4: first 1 info3 1970-01-02 CCC DDD
# 5: second 2 info3 1971-01-02 CCC DDD
# 6: third 3 info3 1972-01-02 CCC DDD