R - 无法在向量或数据帧中存储来自for循环的多个输出

时间:2017-01-26 09:00:11

标签: r string for-loop vector concatenation

我有一个名为lbt_all_epitopes的数据框,包含38282行和3列,如下所示:

 sequence    score epitope.
1 RPGGPPGYRTPYTAK 1.724911  Epitope
2 TQGDRQKIQDAVSAA 1.664611  Epitope
3 EVKSRYNVDVSQNKR 1.593236  Epitope
4 VIEMTRAFEDDDFDK 1.578200  Epitope
5 ITQGDRQKIQDAVSA 1.533208  Epitope
6 GSADLTPSNLTRPAS 1.532700  Epitope

在第一列(命名序列)中,我有多个相似的字符串,我想删除它(我将使用str_sub查找类似的字符串)。例如,考虑第一个字符串lbt_all_epitopes$sequence(" RPGGPPGYRTPYTAK")我想在整个列中查找类似的字符串并将它们存储在vector或{{1 },将被称为data.frame。我想对to_be_removed中出现的前30个元素进行迭代。为简单起见,我们只考虑前五行。当我运行循环时,如下所示:

lbt_all_epitopes$sequence

输出:

# Iterate over the first 5 rows
top_30 <- 1:5

for(i in top_30) {
  print(agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
}

正是我想要的,即它将所有相似的字符串(每次迭代11次)打印到 [1] "RPGGPPGYRTPYTAK" "VGTRPGGPPGYRTPY" "TRPGGPPGYRTPYTA" "GGPPGYRTPYTAKPF" "PGGPPGYRTPYTAKP" [6] "LVGTRPGGPPGYRTP" "TLVGTRPGGPPGYRT" "GPPGYRTPYTAKPFV" "PPGYRTPYTAKPFVM" "GTRPGGPPGYRTPYT" [11] "PGYRTPYTAKPFVMC" [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS" [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DRQKIQDAVSAASSW" "RQKIQDAVSAASSWL" [11] "QKIQDAVSAASSWLE" [1] "EVKSRYNVDVSQNKR" "VKSRYNVDVSQNKRA" "NEVKSRYNVDVSQNK" "KSRYNVDVSQNKRAR" "LNEVKSRYNVDVSQN" [6] "YNVDVSQNKRARLRL" "RYNVDVSQNKRARLR" "MLNEVKSRYNVDVSQ" "SRYNVDVSQNKRARL" "HMLNEVKSRYNVDVS" [11] "EHMLNEVKSRYNVDV" [1] "VIEMTRAFEDDDFDK" "RVIEMTRAFEDDDFD" "GDRVIEMTRAFEDDD" "DRVIEMTRAFEDDDF" "IEMTRAFEDDDFDKF" [6] "RGDRVIEMTRAFEDD" "EMTRAFEDDDFDKFD" "FRGDRVIEMTRAFED" "MTRAFEDDDFDKFDR" "TRAFEDDDFDKFDRV" [11] "RAFEDDDFDKFDRVR" [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS" [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DVQNGITQGDRQKIQ" "DRQKIQDAVSAASSW" [11] "RQKIQDAVSAASSWL" 的第一,第二,第三......第五个元素。 但是,当我尝试将输出存储在向量(称为lbt_all_epitopes$sequence)中时,使用以下循环:

to_be_removed

我注意到每次迭代只产生一个字符串作为输出(而不是每次迭代产生11个字符串),如下所示:

# create the empty vector where I will store the output
to_be_removed <- c()

for(i in top_30) {
  to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

显示以下警告消息:

> to_be_removed
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"

然后我假设我错过了代码告诉R它还应该连接每次迭代产生的所有字符串,然后转到下一次迭代。 有谁知道如何在Warning messages: 1: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], : number of items to replace is not a multiple of replacement length 2: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], : number of items to replace is not a multiple of replacement length 3: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], : number of items to replace is not a multiple of replacement length 4: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], : number of items to replace is not a multiple of replacement length 5: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], : number of items to replace is not a multiple of replacement length 甚至vector中正确存储输出?

2 个答案:

答案 0 :(得分:2)

您可以创建list

# create the empty vector where I will store the output
to_be_removed <- list()

for(i in top_30) {
  to_be_removed[[i]] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

注意填充列表的双括号。

下次请使用dput发布您的数据,以便我们直接使用。这样做: dput(lbt_all_epitopes)返回:

structure(list(X = 1:6, sequence = structure(c(4L, 5L, 1L, 6L, 
3L, 2L), .Label = c("EVKSRYNVDVSQNKR", "GSADLTPSNLTRPAS", "ITQGDRQKIQDAVSA", 
"RPGGPPGYRTPYTAK", "TQGDRQKIQDAVSAA", "VIEMTRAFEDDDFDK"), class = "factor"), 
    score = structure(c(6L, 5L, 4L, 3L, 2L, 1L), .Label = c("1.532700", 
    "1.533208", "1.578200", "1.593236", "1.664611", "1.724911"
    ), class = "factor"), epitope. = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), .Label = "Epitope", class = "factor")), .Names = c("X", 
"sequence", "score", "epitope."), class = "data.frame", row.names = c(NA, 
-6L))

答案 1 :(得分:1)

为了避免增长for() - 循环,我们可以使用lapply()。处理大型数据集时,这应该更快。

to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T))

给出一个列表,其中包含单独列表条目中每行的提取字符串:

[[1]]
[1] "RPGGPPGYRTPYTAK"

[[2]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

[[3]]
[1] "EVKSRYNVDVSQNKR"

[[4]]
[1] "VIEMTRAFEDDDFDK"

[[5]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

现在,您可以将strsplit()unlist()分隔为矢量(可用于子集):

to_be_removed <- unlist(lapply(to_be_removed, function(x) strsplit(x, " ")))

输出:

[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
[7] "ITQGDRQKIQDAVSA"