Mutlistep清洁/正则表达式:在分离器之前替换并提取部分

时间:2018-06-25 11:57:55

标签: r regex grep paste gsub

我有一个矩阵M,其行名如下;

S003_T1_p555
S003_T2_p456
S004_T3_p785
S004_T4_p426
SuperSMART_27_T1_p112
SuperSMART_27_T2_p414
SuperSMART_42_T3_p155
SuperSMART_42_T5_p775

我想做个函数来

  1. 在这种情况下用SuperSMART_替换S
  2. 然后仅提取第一个_之前的字符作为键和 为每个相似的人分配一个唯一的名称

因此S003_T1_p555S003_T2_p456都变成了"group1"S004_T3_p785S004_T4_p426 "group2",依此类推。

MWE

nms <- c("S003_T1_p555", "S003_T2_p456", "S004_T3_p785", "S004_T4_p426", 
    "SuperSMART_27_T1_p112", "SuperSMART_27_T2_p414", 
    "SuperSMART_42_T3_p155", "SuperSMART_42_T5_p775")

M <- matrix(
    seq_along(nms),
    dimnames = list(
        nms,
        'x'    
    )
)

1 个答案:

答案 0 :(得分:4)

library(tidyverse)

as.data.frame(M, stringsAsFactors = FALSE) %>%
    rownames_to_column('id') %>%
    mutate(
        id = gsub('SuperSMART_', 'S', id), 
        id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE) 
    ) %>%
    separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%  
    mutate(., group = group_indices(., S))

##             id    S  R    p x group
## 1 S003_T1_p555 S003 T1 p555 1     1
## 2 S003_T2_p456 S003 T2 p456 2     1
## 3 S004_T3_p785 S004 T3 p785 3     2
## 4 S004_T4_p426 S004 T4 p426 4     2
## 5 S027_T1_p112 S027 T1 p112 5     3
## 6 S027_T2_p414 S027 T2 p414 6     3
## 7 S042_T3_p155 S042 T3 p155 7     4
## 8 S042_T5_p775 S042 T5 p775 8     4


## If you really want it as a function:
normalize_data <- function(m, ..) {
    as.data.frame(m, stringsAsFactors = FALSE) %>%
        tibble::rownames_to_column('id') %>%
        dplyr::mutate(
            id = gsub('SuperSMART_', 'S', id), 
            id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE) 
        ) %>%
        tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%  
        dplyr::mutate(., group = dplyr::group_indices(., S))
}

这是由括号'(^S)(\d{2})(_)'表示的分组捕获。共捕获了3个组。 1:(^S),2:(\d{2}),3:(_)。第一个表示从头开始(^)和S抢夺。第二组说在那之后恰好有两位数字(\\d{2}),然后第三组说必须紧跟下划线。

因此S27_T2_p414将与此匹配,而S004_T3_p785将不匹配。

要替换'\10\2\3'。...如果它与'(^S)(\d{2})(_)'相匹配,我们可以使用perl = TRUE来替换组捕获(由上面的括号表示。\1对应到(^S)\2对应于(\d{2}),并且\3(_)一起出现,我们可以在捕获组之间插入内容,这种技术称为{{ 3}}。在这种情况下,我在第一个捕获组和第二个捕获组之间插入一个额外的零,以确保所有数字都有3位数字。这是假设在S之后的字符串中最多有3位数字。 / p>