我的问题类似于conditional string splitting in R (using tidyr)。但是,我需要拆分超过2列。如果数据集列是
cost
reed_cost
cost of living
reed cost
id gene_id locus
如何将其分为四列
col1 col2 col3 col4
cost
reed cost
cost of living
reed cost
id gene id locus
我尝试了链接中的解决方案,无法正确使用。
答案 0 :(得分:1)
dat <- data.frame(V1 = c("cost", "reed_cost", "cost of living", "reed cost", "id gene_id locus")) # Your data
library(stringr)
vars <- str_split_fixed(dat$V1, " |_", max(str_count(dat$V1, " |_") + 1))
dat2 <- data.frame(t(apply(vars, 1, function(x) c(x[x == ""], x[x != ""]))))
names(dat2) <- paste0("col", seq_len(dim(dat2)[2]))
# col1 col2 col3 col4
# 1 cost
# 2 reed cost
# 3 cost of living
# 4 reed cost
# 5 id gene id locus
答案 1 :(得分:1)
以下两个选项应该可以很好地扩展。您需要&#34; data.table&#34;和&#34; reshape2&#34;已加载,以及my cSplit
function。
library(data.table)
library(reshape2)
library(devtools)
source_gist(11380733) ## For cSplit
第一个假定您实际上并不需要将值浮动到最右边的列。
cSplit(X, "x", sep = " |_", fixed = FALSE)
# x_1 x_2 x_3 x_4
# 1: cost NA NA NA
# 2: reed cost NA NA
# 3: cost of living NA
# 4: reed cost NA NA
# 5: id gene id locus
第二个假设你想要你所显示的表格中的数据:
dcast.data.table( # for long to wide
cSplit(cbind(rn = 1:nrow(X), X), # start by splitting into a long form
"x", sep = " |_", "long",
fixed = FALSE)[,
n := sequence(.N), by = rn][, # sequence by row-name
n := abs(n-max(n))+1], # ^^ reversed
rn ~ n, value.var = "x", fill = "") # formula for casting
# rn 1 2 3 4
# 1: 1 cost
# 2: 2 cost reed
# 3: 3 living of cost
# 4: 4 cost reed
# 5: 5 locus id gene id
答案 2 :(得分:0)
这是一个基本解决方案。我们拆分输入并反转每行的元素。然后我们将每条线的长度设置为最大长度并反转它们:
# test data
x <- c("cost", "reed_cost", "cost of living", "reed cost", "id gene_id locus")
s <- lapply(strsplit(x, "[ _]"), rev)
t(sapply(lapply(s, "length<-", max(sapply(s, length))), rev))
给出这个矩阵:
[,1] [,2] [,3] [,4]
[1,] NA NA NA "cost"
[2,] NA NA "reed" "cost"
[3,] NA "cost" "of" "living"
[4,] NA NA "reed" "cost"
[5,] "id" "gene" "id" "locus"