我试图编写一个简单的R函数来对单个数据帧的两列中的5个元素子串进行采样。每行的字符串长度相等,但它们在列之间不同。当我指定行和col来执行时,该函数有效,但我无法在每行和每列上使用apply语句。如上所述,它只会根据第一个实例的长度提取随机样本,因此如果第一个实例比任何其他字符串短,则其他行的输出有时小于5个元素。
示例df:
BP TF
1 CGTCTCTATTCTAGGCAAGA TTTFFFFTFFFTFFFTFTTT
2 AAGTCACTCGAATTCGGATGCCCCCTAGGC TTFFFFFTFFFFTTFTFFTTTFTTTTFTFF
3 TGCTCATGACGGGAC FFFTFTFFFFTFTFT
'预期输出:'
1 CTATT FFTFF
2 CCTAG TTTFT
3 TCATG TFTFF
'可重现的示例代码:'
#make fake data frame
BaseP1 <- paste(sample(size = 20, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP2 <- paste(sample(size = 30, x = c("A","C","T","G"), replace = TRUE), collapse = "")
BaseP3 <- paste(sample(size = 15, x = c("A","C","T","G"), replace = TRUE), collapse = "")
TrueFalse1 <- paste(sample(size = 20, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse2 <- paste(sample(size = 30, x = c("T","F"), replace = TRUE), collapse = "")
TrueFalse3 <- paste(sample(size = 15, x = c("T","F"), replace = TRUE), collapse = "")
my_df <- data.frame(c(BaseP1,BaseP2,BaseP3), c(TrueFalse1, TrueFalse2, TrueFalse3))
Fragment = function(string) {
nStart = sample(1:nchar(string) -5, 1)
substr(string, nStart, nStart + 4)
}
Fragment(string = my_df[1,1])#works for the first row, first col.
apply(my_df, c(1,2), function(x) Fragment(string = my_df[1:nrow(my_df),1:ncol(my_df)]))
答案 0 :(得分:0)
您的功能出错:
Fragment = function(string) {
nStart = sample(1:(nchar(string) -5), 1)
substr(string, nStart, nStart + 4)
}
nchar(string) - 5
之间缺少括号,这使得子集出错了。
然后,您可以按照评论中的建议使用apply(my_df, c(1,2), Fragment)
。
显示现在可以使用:
for(i in 1:10000){
stopifnot(all(5 == sapply(apply(my_df, c(1,2), Fragment), nchar)))
}
这表明在10000次尝试中,它总是产生5个字符作为输出。