实际上我对这个案例strsplit one column with exact information into two column
有同样的问题这个问题已经解决,只是我的数据看起来像
SNP Geno AlleleA AlleleB AlleleC AlleleD AlleleE
1 marker1 G1 AA AA AA AA AA
2 marker2 G1 TT TT TT TT TT
3 marker3 G1 TT TT TT TT TT
4 marker1 G2 CC CC CC CC CC
5 marker2 G2 AA AA AA AA AA
6 marker3 G2 TT TT TT TT TT
7 marker1 G3 GG GG GG GG GG
8 marker2 G3 AA AA AA AA AA
9 marker3 G3 TT TT TT TT TT
输出输出:
structure(list(SNP = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), .Label = c("marker1", "marker2", "marker3"), class = "factor"),
Geno = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("G1",
"G2", "G3"), class = "factor"), AlleleA = structure(c(1L,
4L, 4L, 2L, 1L, 4L, 3L, 1L, 4L), .Label = c("AA", "CC", "GG",
"TT"), class = "factor"), AlleleB = structure(c(1L, 4L, 4L,
2L, 1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA",
"CC", "GG", "TT")), AlleleC = structure(c(1L, 4L, 4L, 2L,
1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", "CC",
"GG", "TT")), AlleleD = structure(c(1L, 4L, 4L, 2L, 1L, 4L,
3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", "GG",
"TT")), AlleleE = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 3L,
1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", "TT"
))), .Names = c("SNP", "Geno", "AlleleA", "AlleleB", "AlleleC",
"AlleleD", "AlleleE"), row.names = c(NA, -9L), class = "data.frame")
在这个问题上,他只有一列要分成两列。问题是我有5000列(AlleleA,AlleleB .........等)想要拆分(每列一列到两列)
我试过像这样使用循环,但它不起作用,
for(i in colnames(dat)){
dat1 <- data.frame(do.call(rbind, strsplit(as.vector(sprintf("dat$%s",i)), split = "")))
}
我会等你的光, 谢谢
答案 0 :(得分:4)
您可以使用我的“splitstackshape”包中的cSplit
和stripWhite = FALSE
参数。
例如,如果我们想要拆分所有“Allele *”列,我们会这样做:
library(splitstackshape)
cSplit(mydf, grep("Allele", names(mydf)), "", stripWhite = FALSE)
# SNP Geno AlleleA_1 AlleleA_2 AlleleB_1 AlleleB_2 AlleleC_1
# 1: marker1 G1 A A A A A
# 2: marker2 G1 T T T T T
# 3: marker3 G1 T T T T T
# 4: marker1 G2 C C C C C
# 5: marker2 G2 A A A A A
# 6: marker3 G2 T T T T T
# 7: marker1 G3 G G G G G
# 8: marker2 G3 A A A A A
# 9: marker3 G3 T T T T T
# AlleleC_2 AlleleD_1 AlleleD_2 AlleleE_1 AlleleE_2
# 1: A A A A A
# 2: T T T T T
# 3: T T T T T
# 4: C C C C C
# 5: A A A A A
# 6: T T T T T
# 7: G G G G G
# 8: A A A A A
# 9: T T T T T
答案 1 :(得分:3)
另一种选择是
library(qdap)
res <- colsplit2df(dat, splitcols=2:ncol(dat),sep='')
colnames(res)[-1] <- make.names(rep(colnames(dat)[-1],each=2), unique=TRUE)
res[1:3,1:5]
# SNP Geno Geno.1 AlleleA AlleleA.1
#1 marker1 G 1 A A
#2 marker2 G 1 T T
#3 marker3 G 1 T T
或仅适用于Allele
列
colsplit2df(dat, splitcols=grep('Allele', names(dat)),sep='')
编辑(Tyler Rinker)
我建议首先使用setNames
编辑data.frame的列名,如下所示:
setNames(dat, gsub("([A-Z]{1}[a-z]+[A-Z])", "\\1.1&\\1.2", names(dat))) %>%
colsplit2df(splitcols=3:ncol(dat), sep='')
答案 2 :(得分:2)
正如@beginneR所说,你可以使用tidyr::separate
。以下是一个示例:http://blog.rstudio.org/2014/07/22/introducing-tidyr/
head(tidier, 8)
#> id trt key time
#> 1 1 treatment work.T1 0.08514
#> 2 2 control work.T1 0.22544
#> 3 3 treatment work.T1 0.27453
#> 4 4 control work.T1 0.27231
#> 5 1 treatment home.T1 0.61583
#> 6 2 control home.T1 0.42967
#> 7 3 treatment home.T1 0.65166
#> 8 4 control home.T1 0.56774
tidy <- tidier %>%
separate(key, into = c("location", "time"), sep = "\\.")
tidy %>% head(8)
#> id trt location time time
#> 1 1 treatment work T1 0.08514
#> 2 2 control work T1 0.22544
#> 3 3 treatment work T1 0.27453
#> 4 4 control work T1 0.27231
#> 5 1 treatment home T1 0.61583
#> 6 2 control home T1 0.42967
#> 7 3 treatment home T1 0.65166
#> 8 4 control home T1 0.56774