这确实是这个问题的重复 r-split-string-using-tidyrseparate,但我不能将MWE用于我的目的,因为我不知道如何调整正则表达式。 我基本上想要相同的东西,但在最后一个下划线之后拆分变量。
原因:我有一些数据,其中某些列显示多次相同的因子/类型。我想我可以在类型字符串之前将值变量分开并将其再次扩展为具有较少列的宽格式。我的问题是我的变量名有不同的下划线,我想学习如何在我事先添加的最后一个下划线之后分开。
MWE
library(tidyr)
library(data.table)
dt<-data.table(Name=c("A","B","C"),Var_1_EVU=c(2,NA,NA),Var_1_BdS=c(NA,3,4),Var_2_BdS=c(NA,3,4))
dt.long<-melt(dt, id.vars=c("Name"))
dt.long<-separate(dt.long,variable, c("test","type"), sep='/[^_]*$/')
dt.wide<-spread(dt.long,key=Name,value=value)
我想要像
这样的东西 Name type Var1 Var2
1: A BdS NA NA
2: A EVU 2 NA
3: B BdS 3 3
4: B EVU NA NA
5: C BdS 4 4
6: C EVU NA NA
答案 0 :(得分:3)
library(tidyr)
df <- data.frame(Name = c("A","B","C"),
Var_1_EVU = c(2,NA,NA),
Var_1_BdS = c(NA,3,4),
Var_2_BdS = c(NA,3,4))
df %>%
gather("type", "value", -Name) %>%
separate(type, into = c("type", "type_num", "var")) %>%
unite(type, type, type_num, sep = "") %>%
spread(type, value)
# Name var Var1 Var2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA
使用tidyr::extract
来处理具有任意数量下划线的varnames的示例...
library(dplyr)
library(tidyr)
df <- data.frame(Name = c("A","B","C"),
Var_x_1_EVU = c(2,NA,NA),
Var_x_1_BdS = c(NA,3,4),
Var_x_y_2_BdS = c(NA,3,4))
df %>%
gather("col_name", "value", -Name) %>%
extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)
# Name type Var_x_1 Var_x_y_2
# 1 A BdS NA NA
# 2 A EVU 2 NA
# 3 B BdS 3 3
# 4 B EVU NA NA
# 5 C BdS 4 4
# 6 C EVU NA NA
您可以通过首先使用mutate(n = row_number())
添加行号列/变量来避免重复观察的潜在问题,以使每个观察都是唯一的,并且可以避免tidyr::extract
被magrittr
屏蔽通过tidyr::extract
...
library(dplyr)
library(tidyr)
library(data.table)
library(magrittr)
dt <- data.table(Name = c("A", "A", "B", "C"),
Var_1_EVU = c(1, 2, NA, NA),
Var_1_BdS = c(1, NA, 3, 4),
Var_x_2_BdS = c(1, NA, 3, 4))
dt %>%
mutate(n = row_number()) %>%
gather("col_name", "value", -n, -Name) %>%
tidyr::extract(col_name, c("var", "type"), "(.*)_(.*)") %>%
spread(var, value)
# Name n type Var_1 Var_x_2
# 1 A 1 BdS 1 1
# 2 A 1 EVU 1 NA
# 3 A 2 BdS NA NA
# 4 A 2 EVU 2 NA
# 5 B 3 BdS 3 3
# 6 B 3 EVU NA NA
# 7 C 4 BdS 4 4
# 8 C 4 EVU NA NA
答案 1 :(得分:2)
这是使用tstrsplit
/ melt
/ dcast
的替代data.table解决方案
在这种情况下,我个人会坚持使用data.table
,因为spread
没有fun
参数,因此,如果再次传播时有欺骗行为,则会出现错误。< / p>
library(magrittr) # people like pipes these days
dt %>%
# convert ot long format like you did
melt(., id = "Name") %>%
# split by the last underscore
.[, c("variable", "grp") := tstrsplit(variable, "_(?!.*_)", perl = TRUE)] %>%
# convert back to wide format
dcast(., Name + grp ~ variable)
# Name grp Var_1 Var_2
# 1: A BdS NA NA
# 2: A EVU 2 NA
# 3: B BdS 3 3
# 4: B EVU NA NA
# 5: C BdS 4 4
# 6: C EVU NA NA