我有一个数据框,其中包含一个长字符串,每个字符串都与一个'Sample'相关联:
Sample Data
1 000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
我想用一种简单的方法将这个字符串分成5个部分,格式如下:
Sample X
CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168
为每个样本提供如下所示的输出:
Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N
我已经能够使用substr
函数将长字符串分成单个部分,但是id能够自动化它,这样我就可以在一个输出中获得所有5个部分。理想情况下,此输出也是数据帧。
答案 0 :(得分:5)
这是?read.fwf
的用途。
首先看一些类似你问题的数据:
x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N",
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"),
stringsAsFactors = FALSE)
现在使用read.fwf
,指定每个字段的宽度及其名称,并且所有字段都应为模式character
。我们将示例数据的文本列包装在textConnection
中,以便我们可以将其视为read.*
和其他函数通常理解的连接。
(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15")))
CCT6 GAT1 IMD3 PDR3 RIM15
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N
现在循环遍历行并根据您的示例打印出每一行:
for (i in 1:nrow(strs)) {
writeLines(paste("Sample", i))
writeLines(paste(names(strs), strs[i, ], sep = " - "))
}
给予,例如:
Sample 2
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N
答案 1 :(得分:1)
SampX <- textConnection("CCT6 - Characters 1-33
GAT1 - Characters 34-68
IMD3 - Characters 69-99
PDR3 - Characters 100-130
RIM15 - Characters 131-168")
dfSampX <-read.table(SampX, sep="-")
dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2))
sampdat <- read.table(textConnection("Sample Data
1 000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
2 000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N
"), header=TRUE,stringsAsFactors=FALSE)
此代码将分为几个部分:
apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"]) )
[,1] [,2]
[1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
[2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
[,3] [,4]
[1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
[2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
[,5]
[1,] "0000000000000000000N000000N0000000000N"
[2,] "0000000000000000000N000000N0000000000N"
此代码将以列表格式传递片段:
res <- lapply(sampdat$Data, function(x)
apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]) ))
res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)} )
res2
[[1]]
CCT6 GAT1
"000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
IMD3 PDR3
"N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
RIM15
"0000000000000000000N000000N0000000000N"
[[2]]
CCT6 GAT1
"000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0"
IMD3 PDR3
"N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111"
RIM15
"0000000000000000000N000000N0000000000N"
并获得指定的输出格式:
for (samp in seq_along(res2) ) { cat("Sample ", samp, "\n")
invisible( sapply(1:5, function(y)
cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n") ) ) }
Sample 1
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N
Sample 2
CCT6 - 000000000000000000000000000N01000
GAT1 - 000000000N0N000000000N00N0000NN00N0
IMD3 - N000000100000N00N0N0000000NNNN0
PDR3 - 1111111111111111111111111111111
RIM15 - 0000000000000000000N000000N0000000000N
需要invisible
来抑制列表结构中的NULL返回。