根据多个正则表达式子串

时间:2018-03-15 16:04:27

标签: r regex sorting substring

我正在尝试在R中订购一个变量,这是一个包含三个我想要订购的子串的文件名列表。文件名的格式如下:

MAF001.incMHC.zPGS.S1
MAF002.incMHC.zPGS.S1
MAF003.incMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF002.incMHC.zPGS.S2
MAF003.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF001.noMHC.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.noMHC.zPGS.S2

我想首先在MAF子字符串上订购此列表,然后是MHC子字符串,然后是S子字符串,这样订单就是:

MAF001.incMHC.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S1
MAF001.noMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S2
MAF002.incMHC.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF002.incMHC.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.incMHC.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF003.incMHC.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF003.noMHC.zPGS.S2

在看到关于单个子字符串的这个问题的答案之后,我已经玩过gsub: R Sort strings according to substring

但我不知道如何将这个想法扩展到字符串中的多个子串(混合字符和数字类)。

5 个答案:

答案 0 :(得分:2)

这是基础R中的单行:

bar <- foo[order(sapply(strsplit(foo, "\\."), function(x) paste(x[1], x[4])))]
head(data.frame(result = bar), 10)

                          result
1          MAF001.incMHC.zPGS.S1
2  MAF001.noMHC_incRS148.zPGS.S1
3           MAF001.noMHC.zPGS.S1
4          MAF001.incMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S2
6           MAF001.noMHC.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8  MAF002.noMHC_incRS148.zPGS.S1
9           MAF002.noMHC.zPGS.S1
10         MAF002.incMHC.zPGS.S2

说明:

  • 使用.strsplit
  • strsplit(foo, "\\.")拆分字符串
  • 提取并合并元素1和4:paste(x[1], x[4])
  • 使用order
  • 获取所有组合的顺序
  • foo[]
  • 获取相应的值

数据(foo):

c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
"MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
"MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
"MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
"MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
"MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
"MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)

答案 1 :(得分:1)

使用tidyrdplyr

library(tidyr)
library(dplyr)

df <- data.frame(filenames = c(...))

pattern = "^([^.]+)\\.([^.]+)"
df %>%
  extract(filenames, 
          into = c("maf", "mhc"), 
          regex = pattern, remove = FALSE) %>%
  arrange(maf, mhc) %>%
  select(filenames)

哪个收益

                       filenames
1          MAF001.incMHC.zPGS.S1
2          MAF001.incMHC.zPGS.S2
3           MAF001.noMHC.zPGS.S1
4           MAF001.noMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S1
6  MAF001.noMHC_incRS148.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8          MAF002.incMHC.zPGS.S2
9           MAF002.noMHC.zPGS.S1
10          MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13         MAF003.incMHC.zPGS.S1
14         MAF003.incMHC.zPGS.S2
15          MAF003.noMHC.zPGS.S1
16          MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2

答案 2 :(得分:0)

此结果符合您所需的输出,但只会根据b:includeMAF进行排序。我不明白如何使用S字符串进行排序,如果这个答案不能满足您的需求,请详细说明该部分。

MHC

输出是:

library(stringr)
maf <- str_extract(filenames, "MAF\\d+\\.")
mhc <- str_extract(filenames, "\\..*MHC.*\\.")
s <- str_extract(filenames, "S\\d+$")

library(magrittr)
library(dplyr)

data.frame(filenames, maf, mhc, s) %>% 
  arrange(maf, s) %>% 
  select(filenames)

其中 filenames 1 MAF001.incMHC.zPGS.S1 2 MAF001.incMHC.zPGS.S2 3 MAF001.noMHC.zPGS.S1 4 MAF001.noMHC.zPGS.S2 5 MAF001.noMHC_incRS148.zPGS.S1 6 MAF001.noMHC_incRS148.zPGS.S2 7 MAF002.incMHC.zPGS.S1 8 MAF002.incMHC.zPGS.S2 9 MAF002.noMHC.zPGS.S1 10 MAF002.noMHC.zPGS.S2 11 MAF002.noMHC_incRS148.zPGS.S1 12 MAF002.noMHC_incRS148.zPGS.S2 13 MAF003.incMHC.zPGS.S1 14 MAF003.incMHC.zPGS.S2 15 MAF003.noMHC.zPGS.S1 16 MAF003.noMHC.zPGS.S2 17 MAF003.noMHC_incRS148.zPGS.S1 18 MAF003.noMHC_incRS148.zPGS.S2

filenames

答案 3 :(得分:0)

这里已经添加了许多好的解决方案。我添加的另一个仅基于vector的使用。

注意: OP打算对MAFMHCS子字符串进行排序。我坚持用这条规则对这三个人进行排序。因此,我的答案结果可能与其他答案不符。

方法:

  1. 使用regmatches在OP
  2. 中查找每个描述的子字符串
  3. 使用paste根据可以执行sort
  4. 准备字符串
  5. 使用setNames
  6. 设置矢量名称
  7. 在名称上排序vector

    v[order(names(setNames(v, 
          paste(regmatches(v, regexpr("^MAF\\d+", v, perl = TRUE)),
                regmatches(v, regexpr("\\w*MHC\\w*", v, perl = TRUE)),
                regmatches(v, regexpr("\\w+\\d+$", v, perl = TRUE))
               ))))]
    #Result
    [1] "MAF001.incMHC.zPGS.S1"
    [2] "MAF001.incMHC.zPGS.S2"
    [3] "MAF001.noMHC.zPGS.S1"
    [4] "MAF001.noMHC.zPGS.S2"
    [5] "MAF001.noMHC_incRS148.zPGS.S1"
    [6] "MAF001.noMHC_incRS148.zPGS.S2"
    [7] "MAF002.incMHC.zPGS.S1"
    [8] "MAF002.incMHC.zPGS.S2"
    [9] "MAF002.noMHC.zPGS.S1"
    [10] "MAF002.noMHC.zPGS.S2"
    [11] "MAF002.noMHC_incRS148.zPGS.S1"
    [12] "MAF002.noMHC_incRS148.zPGS.S2"
    [13] "MAF003.incMHC.zPGS.S1"
    [14] "MAF003.incMHC.zPGS.S2"
    [15] "MAF003.noMHC.zPGS.S1"
    [16] "MAF003.noMHC.zPGS.S2"
    [17] "MAF003.noMHC_incRS148.zPGS.S1"
    [18] "MAF003.noMHC_incRS148.zPGS.S2"
    
  8. 数据

    v <- c("MAF001.incMHC.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S1", "MAF001.noMHC.zPGS.S1", 
           "MAF001.incMHC.zPGS.S2", "MAF001.noMHC_incRS148.zPGS.S2", "MAF001.noMHC.zPGS.S2", 
           "MAF002.incMHC.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", "MAF002.noMHC.zPGS.S1", 
           "MAF002.incMHC.zPGS.S2", "MAF002.noMHC_incRS148.zPGS.S2", "MAF002.noMHC.zPGS.S2", 
           "MAF003.incMHC.zPGS.S1", "MAF003.noMHC_incRS148.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
           "MAF003.incMHC.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", "MAF003.noMHC.zPGS.S2"
    )
    

答案 4 :(得分:0)

我有一个专门为这样的任务设计的功能:

<强>功能

reg_sort <- function(x,...,verbose=F) {
    ellipsis <-   sapply(as.list(substitute(list(...)))[-1], deparse, simplify="array")
    reg_list <-   paste0(ellipsis, collapse=',')
    reg_list %<>% strsplit(",") %>% unlist %>% gsub("\\\\","\\",.,fixed=T)
    pattern  <-   reg_list %>% map_chr(~sub("^-\\\"","",.) %>% sub("\\\"$","",.) %>% sub("^\\\"","",.) %>% trimws)
    descInd  <-   reg_list %>% map_lgl(~grepl("^-\\\"",.)%>%as.logical)

    reg_extr <-   pattern %>% map(~str_extract(x,.)) %>% c(.,list(x)) %>% as.data.table
    reg_extr[] %<>% lapply(., function(x) type.convert(as.character(x), as.is = TRUE))

    map(rev(seq_along(pattern)),~{reg_extr<<-reg_extr[order(reg_extr[[.]],decreasing = descInd[.])]})

    if(verbose) { tmp<-lapply(reg_extr[,.SD,.SDcols=seq_along(pattern)],unique);names(tmp)<-pattern;tmp %>% print }

    return(reg_extr[[ncol(reg_extr)]])
}

数据:

vec <- c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
  "MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
  "MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
  "MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
  "MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
  "MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
  "MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)

拨打:

reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$")

结果:(字符向量)

1          MAF001.incMHC.zPGS.S1
2          MAF001.incMHC.zPGS.S2
3           MAF001.noMHC.zPGS.S1
4           MAF001.noMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S1
6  MAF001.noMHC_incRS148.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8          MAF002.incMHC.zPGS.S2
9           MAF002.noMHC.zPGS.S1
10          MAF002.noMHC.zPGS.S2
11 MAF002.noMHC_incRS148.zPGS.S1
12 MAF002.noMHC_incRS148.zPGS.S2
13         MAF003.incMHC.zPGS.S1
14         MAF003.incMHC.zPGS.S2
15          MAF003.noMHC.zPGS.S1
16          MAF003.noMHC.zPGS.S2
17 MAF003.noMHC_incRS148.zPGS.S1
18 MAF003.noMHC_incRS148.zPGS.S2

其他功能包括:

  • 降序排序:(添加- infront)reg_sort(x=vec, -"^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)",-"S\\d+$")

  • 详细模式:reg_sort(x=vec, "^.*?(?=\\.)","(?<=\\.).*(?<=\\.S)","S\\d+$",verbose=T)(请参阅/检查regEx模式提取的内容以进行排序)