我有一个像这样的数据框:
structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
我想做的是,对于每行中每个逗号分隔的值,在新列中的“ /”之后获取所有内容,而不管每行中有多少个条目。
我最后想要得到的东西:
mut nt
1 Q184H CAA-CAT
2 I219V ATC-GTC
3 A314T, P373Q, A653E GCG-ACG, CCG-CAG, CGC-GAA
4 0 0
我尝试为此使用正则表达式,但似乎无法匹配每个用逗号分隔的条目。
library(dplyr)
df %>%
mutate(nt = gsub(".+/(.*?)", "\\1", mut))
如何使每个条目都匹配?我必须将它们分开然后进行匹配吗?
答案 0 :(得分:3)
您只需要稍微调整一下正则表达式即可;请注意,我是如何将您的.
更改为[^,]
的。在正则表达式中,如果将字符放在方括号中并在^
之前,则表示匹配所有但字符。因此,[^,]+
意味着要匹配尽可能多的非逗号连续字符。
df = structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC",
"A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")),
row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df %>%
mutate(nt = gsub("[^,]+?/([^,]+?)", "\\1", mut),
mut = gsub("([^/]+)/[^,]+", "\\1", mut))
#> # A tibble: 4 x 2
#> mut nt
#> <chr> <chr>
#> 1 Q184H CAA-CAT
#> 2 I219V ATC-GTC
#> 3 A314T, P373Q, A653E GCG-ACG,CCG-CAG,GCG-GAA
#> 4 0 0
由reprex package(v0.2.1)于2018-10-10创建
答案 1 :(得分:1)
不将此作为答案(@duckmayr进行了正则表达式调试)。发布此soley来向人们展示,通过使用stringi
,我们可以获得自我记录的正则表达式,因此我们将来的自己不会讨厌我们过去的自己:
library(stringi) # it's what stringr uses
library(tidyverse)
xdf <- structure(list(mut = c("Q184H/CAA-CAT", "I219V/ATC-GTC", "A314T/GCG-ACG, P373Q/CCG-CAG, A653E/GCG-GAA","0")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
mutate(
xdf,
nt = stri_replace_all_regex(
str = mut,
pattern = "
[^,]+? # match anything but a comma and if there is one, match at most once
/ # followed by a forward slash
( # start of match group
[^,]+? # same as above
) # end of match group
",
replacement = "$1", # take the match group value as the value
opts_regex = stri_opts_regex(comments=TRUE)
),
mut = stri_replace_all_regex(
str = mut,
pattern = "
( # start of match group
[^/]+ # match anything but a forward slash
) # end of match group
/ # followed by a forward slash
[^,]+ # match anything but a comma
",
replacement = "$1", # take the match group value as the value
opts_regex = stri_opts_regex(comments=TRUE)
)
)