我有一个名为lbt_all_epitopes
的数据框,包含38282行和3列,如下所示:
sequence score epitope.
1 RPGGPPGYRTPYTAK 1.724911 Epitope
2 TQGDRQKIQDAVSAA 1.664611 Epitope
3 EVKSRYNVDVSQNKR 1.593236 Epitope
4 VIEMTRAFEDDDFDK 1.578200 Epitope
5 ITQGDRQKIQDAVSA 1.533208 Epitope
6 GSADLTPSNLTRPAS 1.532700 Epitope
在第一列(命名序列)中,我有多个相似的字符串,我想删除它(我将使用str_sub
查找类似的字符串)。例如,考虑第一个字符串lbt_all_epitopes$sequence
(" RPGGPPGYRTPYTAK")我想在整个列中查找类似的字符串并将它们存储在vector
或{{1 },将被称为data.frame
。我想对to_be_removed
中出现的前30个元素进行迭代。为简单起见,我们只考虑前五行。当我运行循环时,如下所示:
lbt_all_epitopes$sequence
输出:
# Iterate over the first 5 rows
top_30 <- 1:5
for(i in top_30) {
print(agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
}
正是我想要的,即它将所有相似的字符串(每次迭代11次)打印到 [1] "RPGGPPGYRTPYTAK" "VGTRPGGPPGYRTPY" "TRPGGPPGYRTPYTA" "GGPPGYRTPYTAKPF" "PGGPPGYRTPYTAKP"
[6] "LVGTRPGGPPGYRTP" "TLVGTRPGGPPGYRT" "GPPGYRTPYTAKPFV" "PPGYRTPYTAKPFVM" "GTRPGGPPGYRTPYT"
[11] "PGYRTPYTAKPFVMC"
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
[6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DRQKIQDAVSAASSW" "RQKIQDAVSAASSWL"
[11] "QKIQDAVSAASSWLE"
[1] "EVKSRYNVDVSQNKR" "VKSRYNVDVSQNKRA" "NEVKSRYNVDVSQNK" "KSRYNVDVSQNKRAR" "LNEVKSRYNVDVSQN"
[6] "YNVDVSQNKRARLRL" "RYNVDVSQNKRARLR" "MLNEVKSRYNVDVSQ" "SRYNVDVSQNKRARL" "HMLNEVKSRYNVDVS"
[11] "EHMLNEVKSRYNVDV"
[1] "VIEMTRAFEDDDFDK" "RVIEMTRAFEDDDFD" "GDRVIEMTRAFEDDD" "DRVIEMTRAFEDDDF" "IEMTRAFEDDDFDKF"
[6] "RGDRVIEMTRAFEDD" "EMTRAFEDDDFDKFD" "FRGDRVIEMTRAFED" "MTRAFEDDDFDKFDR" "TRAFEDDDFDKFDRV"
[11] "RAFEDDDFDKFDRVR"
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
[6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DVQNGITQGDRQKIQ" "DRQKIQDAVSAASSW"
[11] "RQKIQDAVSAASSWL"
的第一,第二,第三......第五个元素。
但是,当我尝试将输出存储在向量(称为lbt_all_epitopes$sequence
)中时,使用以下循环:
to_be_removed
我注意到每次迭代只产生一个字符串作为输出(而不是每次迭代产生11个字符串),如下所示:
# create the empty vector where I will store the output
to_be_removed <- c()
for(i in top_30) {
to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}
显示以下警告消息:
> to_be_removed
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
然后我假设我错过了代码告诉R它还应该连接每次迭代产生的所有字符串,然后转到下一次迭代。
有谁知道如何在Warning messages:
1: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
2: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
3: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
4: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
5: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], :
number of items to replace is not a multiple of replacement length
甚至vector
中正确存储输出?
答案 0 :(得分:2)
您可以创建list
:
# create the empty vector where I will store the output
to_be_removed <- list()
for(i in top_30) {
to_be_removed[[i]] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}
注意填充列表的双括号。
下次请使用dput
发布您的数据,以便我们直接使用。这样做:
dput(lbt_all_epitopes)
返回:
structure(list(X = 1:6, sequence = structure(c(4L, 5L, 1L, 6L,
3L, 2L), .Label = c("EVKSRYNVDVSQNKR", "GSADLTPSNLTRPAS", "ITQGDRQKIQDAVSA",
"RPGGPPGYRTPYTAK", "TQGDRQKIQDAVSAA", "VIEMTRAFEDDDFDK"), class = "factor"),
score = structure(c(6L, 5L, 4L, 3L, 2L, 1L), .Label = c("1.532700",
"1.533208", "1.578200", "1.593236", "1.664611", "1.724911"
), class = "factor"), epitope. = structure(c(1L, 1L, 1L,
1L, 1L, 1L), .Label = "Epitope", class = "factor")), .Names = c("X",
"sequence", "score", "epitope."), class = "data.frame", row.names = c(NA,
-6L))
答案 1 :(得分:1)
为了避免增长for()
- 循环,我们可以使用lapply()
。处理大型数据集时,这应该更快。
to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
给出一个列表,其中包含单独列表条目中每行的提取字符串:
[[1]]
[1] "RPGGPPGYRTPYTAK"
[[2]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"
[[3]]
[1] "EVKSRYNVDVSQNKR"
[[4]]
[1] "VIEMTRAFEDDDFDK"
[[5]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"
现在,您可以将strsplit()
和unlist()
分隔为矢量(可用于子集):
to_be_removed <- unlist(lapply(to_be_removed, function(x) strsplit(x, " ")))
输出:
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
[7] "ITQGDRQKIQDAVSA"