Question

我的数据框bp_example如下所示：

 structure(list(Sequence = c("Sequence", "Sequence", "Sequence", 
"Sequence", "Sequence", "Sequence", "Sequence", "Sequence", "Sequence", 
"Sequence", "Sequence", "Sequence", "Sequence", "Sequence", "Sequence", 
"Sequence", "Sequence", "Sequence", "Sequence", "Sequence", "Sequence", 
"Sequence", "Sequence", "Sequence", "Sequence"), start = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25), end = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25), 
    score = c(-0.205, -0.229, -0.115, -0.427, -0.327, -0.543, 
    -0.717, -0.923, -1.241, -1.471, -1.737, -1.717, -1.247, -1.137, 
    -0.689, -0.731, -0.337, 0.091, 0.579, 0.93, 0.575, 0.128, 
    -0.036, -0.186, -0.259), residue = c("M", "D", "A", "R", 
    "M", "R", "E", "L", "S", "F", "K", "V", "V", "L", "L", "G", 
    "E", "G", "R", "V", "G", "K", "T", "S", "L"), epitope = c(".", 
    ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", 
    ".", ".", ".", ".", ".", "E", "E", "E", ".", ".", ".", "."
    )), .Names = c("Sequence", "start", "end", "score", "residue", 
"epitope"), class = c("data.table", "data.frame"), row.names = c(NA, 
-25L))

我不确定是否有可能做我想做的事，但无论如何，在这里。我想迭代列bp_example$epitope，如果连续14个“Es”，即15个或更多个连续行，其中“E”出现在bp_example$epitope列中，我希望将上一列（bp_example$residue）上的相应字符打印为单个字符串（因子）。

考虑到我给出的示例，我希望打印字符串MDARMRELSFKVVLLG（最好存储为list或data.frame的元素。）

我有while个循环，但根本没有成功。

Answer 1

以下是使用data.table的选项。转换＆＃39; data.frame＆＃39;到＆＃39; data.table＆＃39; （setDT(df1)），创建一个run-lengh-id（rleid）列（＆＃39; grp＆＃39;基于＆＃34; E＆＃34;＆＃39;值＆＃39; ;缩略词＆＃39;。按照＆＃39;序列＆＃39;＆＃39; grp＆＃39;，我们在i（epitome == "E"）和if中指定逻辑条件行数（.N）大于14，然后paste＆＃39;残差元素一起

library(data.table)
setDT(df1)[, grp := rleid(epitope=="E")][epitope == "E",
     .(residueConcat = if(.N > 14) paste(trimws(residue), collapse="")), .(Sequence, grp)]

Answer 2

使用基础R的选项。我不认为你必须使用循环来执行此操作。在下面的代码中，我建议找到匹配的索引，在结果向量中，找到超过14个元素的序列：

#find matchin indexes
matching <- which(bp_example$epitope == 'E')

#separate vectors with elements in sequence
index <- split(matching, cumsum(seq_along(matching) %in% (which(diff(matching)>1)+1)))

#get the result by subscripting with indexes from vectors 
result <- lapply(index, function(x) if(length(x)> 14) paste0(bp_example$residue[x], collapse = ''))

要将最终结果作为数据框，将每个匹配的序列作为新行：

as.data.frame(unlist(result))

根据列旁边的输入从一列中获取字符

2 个答案: