我正在尝试在R中订购一个变量,这是一个包含三个我想要订购的子串的文件名列表。文件名的格式如下:
MAF001.incMHC.zPGS.S1
MAF002.incMHC.zPGS.S1
MAF003.incMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF002.incMHC.zPGS.S2
MAF003.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF001.noMHC.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.noMHC.zPGS.S2
我想首先在MAF子字符串上订购此列表,然后是MHC子字符串,然后是S子字符串,这样订单就是:
MAF001.incMHC.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S1
MAF001.noMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S2
MAF002.incMHC.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF002.incMHC.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.incMHC.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF003.incMHC.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF003.noMHC.zPGS.S2
在看到关于单个子字符串的这个问题的答案之后,我已经玩过gsub: R Sort strings according to substring
但我不知道如何将这个想法扩展到字符串中的多个子串(混合字符和数字类)。
答案 0 :(得分:2)
这是基础R中的单行:
bar <- foo[order(sapply(strsplit(foo, "\\."), function(x) paste(x[1], x[4])))]
head(data.frame(result = bar), 10)
result
1 MAF001.incMHC.zPGS.S1
2 MAF001.noMHC_incRS148.zPGS.S1
3 MAF001.noMHC.zPGS.S1
4 MAF001.incMHC.zPGS.S2
5 MAF001.noMHC_incRS148.zPGS.S2
6 MAF001.noMHC.zPGS.S2
7 MAF002.incMHC.zPGS.S1
8 MAF002.noMHC_incRS148.zPGS.S1
9 MAF002.noMHC.zPGS.S1
10 MAF002.incMHC.zPGS.S2
说明:
.
:strsplit
strsplit(foo, "\\.")
拆分字符串
paste(x[1], x[4])
order
foo[]
数据(foo
):
c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1",
"MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2",
"MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1",
"MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2",
"MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2",
"MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1",
"MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)
答案 1 :(得分:1)
使用tidyr
和dplyr
:
library(tidyr)
library(dplyr)
df <- data.frame(filenames = c(...))
pattern = "^([^.]+)\\.([^.]+)"
df %>%
extract(filenames,
into = c("maf", "mhc"),
regex = pattern, remove = FALSE) %>%
arrange(maf, mhc) %>%
select(filenames)
哪个收益
filenames
1 MAF001.incMHC.zPGS.S1
2 MAF001.incMHC.zPGS.S2
3 MAF001.noMHC.zPGS.S1
4 MAF001.noMHC.zPGS.S2
5 MAF001.noMHC_incRS148.zPGS.S1
6 MAF001.noMHC_incRS148.zPGS.S2
7 MAF002.incMHC.zPGS.S1
8 MAF002.incMHC.zPGS.S2
9 MAF002.noMHC.zPGS.S1
10 MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13 MAF003.incMHC.zPGS.S1
14 MAF003.incMHC.zPGS.S2
15 MAF003.noMHC.zPGS.S1
16 MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2
答案 2 :(得分:0)
此结果符合您所需的输出,但只会根据b:include
和MAF
进行排序。我不明白如何使用S
字符串进行排序,如果这个答案不能满足您的需求,请详细说明该部分。
MHC
输出是:
library(stringr)
maf <- str_extract(filenames, "MAF\\d+\\.")
mhc <- str_extract(filenames, "\\..*MHC.*\\.")
s <- str_extract(filenames, "S\\d+$")
library(magrittr)
library(dplyr)
data.frame(filenames, maf, mhc, s) %>%
arrange(maf, s) %>%
select(filenames)
其中 filenames
1 MAF001.incMHC.zPGS.S1
2 MAF001.incMHC.zPGS.S2
3 MAF001.noMHC.zPGS.S1
4 MAF001.noMHC.zPGS.S2
5 MAF001.noMHC_incRS148.zPGS.S1
6 MAF001.noMHC_incRS148.zPGS.S2
7 MAF002.incMHC.zPGS.S1
8 MAF002.incMHC.zPGS.S2
9 MAF002.noMHC.zPGS.S1
10 MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13 MAF003.incMHC.zPGS.S1
14 MAF003.incMHC.zPGS.S2
15 MAF003.noMHC.zPGS.S1
16 MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2
是
filenames
答案 3 :(得分:0)
这里已经添加了许多好的解决方案。我添加的另一个仅基于vector
的使用。
注意: OP
打算对MAF
,MHC
和S
子字符串进行排序。我坚持用这条规则对这三个人进行排序。因此,我的答案结果可能与其他答案不符。
方法:
regmatches
在OP paste
根据可以执行sort
setNames
在名称上排序vector
。
v[order(names(setNames(v,
paste(regmatches(v, regexpr("^MAF\\d+", v, perl = TRUE)),
regmatches(v, regexpr("\\w*MHC\\w*", v, perl = TRUE)),
regmatches(v, regexpr("\\w+\\d+$", v, perl = TRUE))
))))]
#Result
[1] "MAF001.incMHC.zPGS.S1"
[2] "MAF001.incMHC.zPGS.S2"
[3] "MAF001.noMHC.zPGS.S1"
[4] "MAF001.noMHC.zPGS.S2"
[5] "MAF001.noMHC_incRS148.zPGS.S1"
[6] "MAF001.noMHC_incRS148.zPGS.S2"
[7] "MAF002.incMHC.zPGS.S1"
[8] "MAF002.incMHC.zPGS.S2"
[9] "MAF002.noMHC.zPGS.S1"
[10] "MAF002.noMHC.zPGS.S2"
[11] "MAF002.noMHC_incRS148.zPGS.S1"
[12] "MAF002.noMHC_incRS148.zPGS.S2"
[13] "MAF003.incMHC.zPGS.S1"
[14] "MAF003.incMHC.zPGS.S2"
[15] "MAF003.noMHC.zPGS.S1"
[16] "MAF003.noMHC.zPGS.S2"
[17] "MAF003.noMHC_incRS148.zPGS.S1"
[18] "MAF003.noMHC_incRS148.zPGS.S2"
数据强>
v <- c("MAF001.incMHC.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S1", "MAF001.noMHC.zPGS.S1",
"MAF001.incMHC.zPGS.S2", "MAF001.noMHC_incRS148.zPGS.S2", "MAF001.noMHC.zPGS.S2",
"MAF002.incMHC.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", "MAF002.noMHC.zPGS.S1",
"MAF002.incMHC.zPGS.S2", "MAF002.noMHC_incRS148.zPGS.S2", "MAF002.noMHC.zPGS.S2",
"MAF003.incMHC.zPGS.S1", "MAF003.noMHC_incRS148.zPGS.S1", "MAF003.noMHC.zPGS.S1",
"MAF003.incMHC.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)
答案 4 :(得分:0)
我有一个专门为这样的任务设计的功能:
<强>功能强>
reg_sort <- function(x,...,verbose=F) {
ellipsis <- sapply(as.list(substitute(list(...)))[-1], deparse, simplify="array")
reg_list <- paste0(ellipsis, collapse=',')
reg_list %<>% strsplit(",") %>% unlist %>% gsub("\\\\","\\",.,fixed=T)
pattern <- reg_list %>% map_chr(~sub("^-\\\"","",.) %>% sub("\\\"$","",.) %>% sub("^\\\"","",.) %>% trimws)
descInd <- reg_list %>% map_lgl(~grepl("^-\\\"",.)%>%as.logical)
reg_extr <- pattern %>% map(~str_extract(x,.)) %>% c(.,list(x)) %>% as.data.table
reg_extr[] %<>% lapply(., function(x) type.convert(as.character(x), as.is = TRUE))
map(rev(seq_along(pattern)),~{reg_extr<<-reg_extr[order(reg_extr[[.]],decreasing = descInd[.])]})
if(verbose) { tmp<-lapply(reg_extr[,.SD,.SDcols=seq_along(pattern)],unique);names(tmp)<-pattern;tmp %>% print }
return(reg_extr[[ncol(reg_extr)]])
}
数据:强>
vec <- c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1",
"MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2",
"MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1",
"MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2",
"MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2",
"MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1",
"MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)
拨打:强>
reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$")
结果:(字符向量)
1 MAF001.incMHC.zPGS.S1
2 MAF001.incMHC.zPGS.S2
3 MAF001.noMHC.zPGS.S1
4 MAF001.noMHC.zPGS.S2
5 MAF001.noMHC_incRS148.zPGS.S1
6 MAF001.noMHC_incRS148.zPGS.S2
7 MAF002.incMHC.zPGS.S1
8 MAF002.incMHC.zPGS.S2
9 MAF002.noMHC.zPGS.S1
10 MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13 MAF003.incMHC.zPGS.S1
14 MAF003.incMHC.zPGS.S2
15 MAF003.noMHC.zPGS.S1
16 MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2
其他功能包括:
降序排序:(添加-
infront)reg_sort(x=vec, -"^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)",-"S\\d+$")
详细模式:reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$",verbose=T)
(请参阅/检查regEx模式提取的内容以进行排序)