使用一个软件,我可以像这样计算指纹:
>L
[1] "1 1:1 2:1 3:1 5:1 6:1 8:1"
[2] "5 1:1 2:1 4:1"
[3] "9 1:1 2:1 7:1 10:1"
第一个值:1,5,9是相应的分子名称,剩下的是相应的指纹,它们有一个固定的长度,比如10.这意味着“:”左边的一个是位置,右边是位,其中1表示该位,0表示省略(表示无位),所以我想恢复原始格式。那是10位,每一位应该有相应的值:
L应该这样,我可以将L保存为csv格式。
mol 1 2 3 4 5 6 7 8 9 10
1 1 1 1 0 1 1 0 1 0 0
5 1 1 0 1 0 0 0 0 0 0
9 1 1 0 0 0 0 1 0 0 1
这里,L有百万行,转换所需格式的有效方法是什么?
感谢。
答案 0 :(得分:2)
要避免read.csv
使用strsplit
和未导出的splitstackshape:::numMat
功能:
M <- strsplit(L, "\\s+|:")
cbind(mol = as.numeric(sapply(M, `[`, 1)),
splitstackshape:::numMat(lapply(M, `[`, -1), fill=0))
好奇......
样本数据:
L <- c("1 1:1 2:1 3:1 5:1 6:1 8:1",
"5 1:1 2:1 4:1",
"9 1:1 2:1 7:1 10:1")
M <- replicate(10000, L)
@ thelatemail的答案:
fun1 <- function() {
spl <- lapply(strsplit(M,"\\s+|:.? |:.$"),as.numeric)
vals <- lapply(spl,"[",-1)
data.frame(
mol=sapply(spl,"[",1),
t(sapply(vals, function(x) {
out <- rep(0,max(unlist(vals)))
out[x] <- 1
out} ))
)
}
system.time(out_late <- fun1())
# user system elapsed
# 98.36 1.28 100.06
head(out_late)
# mol X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# 1 1 1 1 1 0 1 1 0 1 0 0
# 2 5 1 1 0 1 0 0 0 0 0 0
# 3 9 1 1 0 0 0 0 1 0 0 1
# 4 1 1 1 1 0 1 1 0 1 0 0
# 5 5 1 1 0 1 0 0 0 0 0 0
# 6 9 1 1 0 0 0 0 1 0 0 1
我的最新回答:
library(splitstackshape)
fun2 <- function() {
M <- strsplit(M, "\\s+|:")
cbind(mol = as.numeric(sapply(M, `[`, 1)),
splitstackshape:::numMat(lapply(M, `[`, -1), fill=0))
}
system.time(out_ananda <- fun2())
# user system elapsed
# 0.67 0.00 0.68
head(out_ananda)
# mol 1 2 3 4 5 6 7 8 9 10
# [1,] 1 1 1 1 0 1 1 0 1 0 0
# [2,] 5 1 1 0 1 0 0 0 0 0 0
# [3,] 9 1 1 0 0 0 0 1 0 0 1
# [4,] 1 1 1 1 0 1 1 0 1 0 0
# [5,] 5 1 1 0 1 0 0 0 0 0 0
# [6,] 9 1 1 0 0 0 0 1 0 0 1
@Matthew的回答。请注意,需要修改此值以接受不同的"val"
值。
fun3 <- function() {
t(sapply(strsplit(M, "\\s+"), function(l) {
mol <- as.numeric(l[1])
names(mol) <- 'mol'
val <- numeric(10)
names(val) <- 1:10
for (x in strsplit(l[-1], ":"))
val[x[1]] <- as.numeric(x[2])
c(mol, val)
}))
}
system.time(out_matthew <- fun3())
# user system elapsed
# 2.33 0.00 2.34
head(out_matthew)
# mol 1 2 3 4 5 6 7 8 9 10
# [1,] 1 1 1 1 0 1 1 0 1 0 0
# [2,] 5 1 1 0 1 0 0 0 0 0 0
# [3,] 9 1 1 0 0 0 0 1 0 0 1
# [4,] 1 1 1 1 0 1 1 0 1 0 0
# [5,] 5 1 1 0 1 0 0 0 0 0 0
# [6,] 9 1 1 0 0 0 0 1 0 0 1
答案 1 :(得分:2)
尝试使用基本R函数,假设L
与@Ananda使用的相同。
spl <- lapply(strsplit(L,"\\s+|:.? |:.$"),as.numeric)
vals <- lapply(spl,"[",-1)
data.frame(
mol=sapply(spl,"[",1),
t(sapply(vals, function(x) {
out <- rep(0,max(unlist(vals)))
out[x] <- 1
out} ))
)
# mol X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#1 1 1 1 1 0 1 1 0 1 0 0
#2 5 1 1 0 1 0 0 0 0 0 0
#3 9 1 1 0 0 0 0 1 0 0 1
答案 2 :(得分:2)
借用thelatemail,这是一个返回适当元素矩阵的表达式。我没有将值设置为1,而是将值设置为:
循环中for
个字符后面的值。然后将整个事物转换为您想要的格式。
t(sapply(strsplit(L, "\\s+"), function(l) {
# Each line is passed in as a vector, the first element is "mol"
mol <- as.numeric(l[1])
names(mol) <- 'mol'
# Store the values in a vector of length 10, with names
val <- numeric(10)
names(val) <- 1:10
# Split the tail of the input vector on ":" and assign to the proper slot of the output vector
for (x in strsplit(l[-1], ":"))
val[x[1]] <- as.numeric(x[2])
# Put them back together
c(mol, val)
}))
## mol 1 2 3 4 5 6 7 8 9 10
## [1,] 1 1 1 1 0 1 1 0 1 0 0
## [2,] 5 1 1 0 1 0 0 0 0 0 0
## [3,] 9 1 1 0 0 0 0 1 0 0 1