我试图从以下向量中提取一些变量名和数字,并将它们存储到两个新变量中:
unique_strings <- c("PM_1_PMS5003_S_Avg", "PM_2_5_PMS5003_S_Avg", "PM_10_PMS5003_S_Avg",
"PM_1_PMS5003_A_Avg", "PM_2_5_PMS5003_A_Avg", "PM_10_PMS5003_A_Avg",
"PNC_0_3_PMS5003_Avg", "PNC_0_5_PMS5003_Avg", "PNC_1_0_PMS5003_Avg",
"PNC_2_5_PMS5003_Avg", "PNC_5_0_PMS5003_Avg", "PNC_10_0_PMS5003_Avg",
"PM_1_PMS7003_S_Avg", "PM_2_5_PMS7003_S_Avg", "PM_10_PMS7003_S_Avg",
"PM_1_PMS7003_A_Avg", "PM_2_5_PMS7003_A_Avg", "PM_10_PMS7003_A_Avg",
"PNC_0_3_PMS7003_Avg", "PNC_0_5_PMS7003_Avg", "PNC_1_0_PMS7003_Avg",
"PNC_2_5_PMS7003_Avg", "PNC_5_0_PMS7003_Avg", "PNC_10_0_PMS7003_Avg"
)
我想在PMS
之前为第一个变量提取每个字符。这包括与PM
或PNC
一起使用的字符串,以及下划线和数字。我想将这些结果存储到名为pollutant
的变量中。
期望的输出:
unique(pollutant)
[1] "PM_1" "PM_2_5" "PM_10" "PNC_0_3" "PNC_0_5" "PNC_1_0" "PNC_2_5" "PNC_5_0" "PNC_10"
我想在PMS
之后为第二个变量提取所有内容。
为此,我首先尝试从每个字符串中仅提取模型编号(以003
结尾的四位数字),但是,包含A_Avg
或{{1}会很有用在提取中也是如此。
这是我的第一次尝试:
S_Avg
之前我没有使用过正则表达式,并且很难在现有文档/堆栈帖子中导航。感谢您的投入!
答案 0 :(得分:2)
我们可以使用str_split
根据"PMS"
拆分字符串。之后,使用str_replace
删除第一列中的最后一个"_"
。输出为m
。第一个变量位于第一列,而第二个变量位于第二列。
library(stringr)
m <- str_split(unique_strings, pattern = "PMS", simplify = TRUE)
m[, 1] <- str_replace(m[, 1], "_$", "")
m
# [,1] [,2]
# [1,] "PM_1" "5003_S_Avg"
# [2,] "PM_2_5" "5003_S_Avg"
# [3,] "PM_10" "5003_S_Avg"
# [4,] "PM_1" "5003_A_Avg"
# [5,] "PM_2_5" "5003_A_Avg"
# [6,] "PM_10" "5003_A_Avg"
# [7,] "PNC_0_3" "5003_Avg"
# [8,] "PNC_0_5" "5003_Avg"
# [9,] "PNC_1_0" "5003_Avg"
# [10,] "PNC_2_5" "5003_Avg"
# [11,] "PNC_5_0" "5003_Avg"
# [12,] "PNC_10_0" "5003_Avg"
# [13,] "PM_1" "7003_S_Avg"
# [14,] "PM_2_5" "7003_S_Avg"
# [15,] "PM_10" "7003_S_Avg"
# [16,] "PM_1" "7003_A_Avg"
# [17,] "PM_2_5" "7003_A_Avg"
# [18,] "PM_10" "7003_A_Avg"
# [19,] "PNC_0_3" "7003_Avg"
# [20,] "PNC_0_5" "7003_Avg"
# [21,] "PNC_1_0" "7003_Avg"
# [22,] "PNC_2_5" "7003_Avg"
# [23,] "PNC_5_0" "7003_Avg"
# [24,] "PNC_10_0" "7003_Avg"
答案 1 :(得分:1)
我们可以使用str_extract
来匹配字符串(^
)的开头(^(PM|PNC)
)之后的'PM'或'PNC',后跟_
和一个或多个数字(\\d+
),后跟另一组_
和数字的情况(为此我们指定零或更多((_\\d)*
)
library(stringr)
out <- str_extract(unique_strings, "^(PM|PNC)_\\d+(_\\d)*")
这将为那些没有匹配的元素提供NA
。如果我们需要删除那些
na.omit(out)
对于第二种情况,不清楚所需的输出。如果我们需要在PMS
之后提取所有内容,我们可以使用regexlookbehind到((?<=PMS)
)并匹配后面的所有字符(.*
)
str_extract(unique_strings, "(?<=PMS).*")