我需要有关正则表达式的帮助,该正则表达式提取由下划线分隔的第三个元素。下划线的数量是可变的。我可以使用str_split来做到这一点,但是有没有办法使用str_replace获得与以下相同的结果?
(期望的结果是x = AAAA, BBBB, CCCC, DDDD
。如果可能,请使用()
保持分组。)
library(tidyverse)
library(stringr)
d <- enframe(c("asfe_01_AAAA_fses_feee",
"asfe_87_BBBB_fses_feee",
"99_fesf_CCCC_feee",
"99_fesf_DDDD"),
name = NULL, value = "txt")
d %>%
mutate(x = str_replace(txt, "(.+)_(.+)_(.+)_*(.*)_*(.*)", "\\3"),
want_strsplit = str_split(txt, "_", simplify = TRUE)[, 3])
#txt x want_strsplit
# <chr> <chr> <chr>
#1 asfe_01_AAAA_fses_feee feee AAAA
#2 asfe_87_BBBB_fses_feee feee BBBB
#3 99_fesf_CCCC_feee feee CCCC
#4 99_fesf_DDDD DDDD DDDD
答案 0 :(得分:5)
您可以再利用strsplit
。
mapply(`[`, strsplit(d$txt, "_"), 3)
# [1] "AAAA" "BBBB" "CCCC" "DDDD"
整件事:
splt <- strsplit(d$txt, "_")
cbind(d, x=mapply(`[`, splt, lengths(splt)), want_strsplit=mapply(`[`, splt, 3))
# txt x want_strsplit
# 1 asfe_01_AAAA_fses_feee feee AAAA
# 2 asfe_87_BBBB_fses_feee feee BBBB
# 3 99_fesf_CCCC_feee feee CCCC
# 4 99_fesf_DDDD DDDD DDDD
答案 1 :(得分:3)
使用str_replace
> d%>%mutate(x=str_replace(txt,"^((?:[^_]*_){2})([a-zA-Z]+).*","\\2"))
# A tibble: 4 x 2
txt x
<chr> <chr>
1 asfe_01_AAAA_fses_feee AAAA
2 asfe_87_BBBB_fses_feee BBBB
3 99_fesf_CCCC_feee CCCC
4 99_fesf_DDDD DDDD
第一组捕获_
的前两次出现。第二组捕获最后一组之后的所有文本。
如果还可以有数字,则可以使用[[:alnum:]]
d%>%mutate(x=str_replace(txt,"^((?:[^_]*_){2})([[:alnum:]]+).*","\\2"))
答案 2 :(得分:3)
带有sub
sub("^(([^_]+_){2})([^_]+).*", "\\3", d$txt)
#[1] "AAAA" "BBBB" "CCCC" "DDDD"
答案 3 :(得分:1)
d %>%
mutate(x = str_replace(txt, "^([^_]+)_([^_]+)_([^_]+).*", "\\3"))
[^_]
代表除_