我有一个矩阵M
,其行名如下;
S003_T1_p555
S003_T2_p456
S004_T3_p785
S004_T4_p426
SuperSMART_27_T1_p112
SuperSMART_27_T2_p414
SuperSMART_42_T3_p155
SuperSMART_42_T5_p775
我想做个函数来
SuperSMART_
替换S
_
之前的字符作为键和
为每个相似的人分配一个唯一的名称因此S003_T1_p555
和S003_T2_p456
都变成了"group1"
,S004_T3_p785
和S004_T4_p426
"group2"
,依此类推。
nms <- c("S003_T1_p555", "S003_T2_p456", "S004_T3_p785", "S004_T4_p426",
"SuperSMART_27_T1_p112", "SuperSMART_27_T2_p414",
"SuperSMART_42_T3_p155", "SuperSMART_42_T5_p775")
M <- matrix(
seq_along(nms),
dimnames = list(
nms,
'x'
)
)
答案 0 :(得分:4)
library(tidyverse)
as.data.frame(M, stringsAsFactors = FALSE) %>%
rownames_to_column('id') %>%
mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
mutate(., group = group_indices(., S))
## id S R p x group
## 1 S003_T1_p555 S003 T1 p555 1 1
## 2 S003_T2_p456 S003 T2 p456 2 1
## 3 S004_T3_p785 S004 T3 p785 3 2
## 4 S004_T4_p426 S004 T4 p426 4 2
## 5 S027_T1_p112 S027 T1 p112 5 3
## 6 S027_T2_p414 S027 T2 p414 6 3
## 7 S042_T3_p155 S042 T3 p155 7 4
## 8 S042_T5_p775 S042 T5 p775 8 4
## If you really want it as a function:
normalize_data <- function(m, ..) {
as.data.frame(m, stringsAsFactors = FALSE) %>%
tibble::rownames_to_column('id') %>%
dplyr::mutate(
id = gsub('SuperSMART_', 'S', id),
id = gsub('(^S)(\\d{2})(_)', '\\10\\2\\3', id, perl = TRUE)
) %>%
tidyr::separate(id, into = c('S', 'R', 'p'), sep = '_', remove = FALSE) %>%
dplyr::mutate(., group = dplyr::group_indices(., S))
}
这是由括号'(^S)(\d{2})(_)'
表示的分组捕获。共捕获了3个组。 1:(^S)
,2:(\d{2})
,3:(_)
。第一个表示从头开始(^
)和S
抢夺。第二组说在那之后恰好有两位数字(\\d{2}
),然后第三组说必须紧跟下划线。
因此S27_T2_p414
将与此匹配,而S004_T3_p785
将不匹配。
要替换'\10\2\3'
。...如果它与'(^S)(\d{2})(_)'
相匹配,我们可以使用perl = TRUE
来替换组捕获(由上面的括号表示。\1
对应到(^S)
; \2
对应于(\d{2})
,并且\3
与(_)
一起出现,我们可以在捕获组之间插入内容,这种技术称为{{ 3}}。在这种情况下,我在第一个捕获组和第二个捕获组之间插入一个额外的零,以确保所有数字都有3位数字。这是假设在S
之后的字符串中最多有3位数字。 / p>