我有一个很长的字符串,例如下面的示例波纹管,我正在努力寻找一个正则表达式来根据模式将其拆分为多个部分,例如:'1。 OAS / AC”和“ 2。 OAS / AD”。
这部分文字具有:
1)开头的数字不同
2)两个大写字母,从A到Z
我尝试过:
x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
但不起作用
在此先感谢您的帮助!
示例
require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")
want <- list(
"1. OAS / AC " = "12345/this is a test string to regex,",
"2. OAS / AD " = "79856/this is another test string to regex,",
"3. OAS / AE " = "87987/this is a new test string to regex.",
"4. OAS / AZ " = "78798456/this is one mode test string to regex."
)
答案 0 :(得分:1)
我们可以使用正向先行进行查找,寻找数字的模式,然后跟一个句号:
str_split(have, "(?=\\d+\\.)")
[1] "" "1. OAS / AC 12345/this is a test string to regex, "
[3] "2. OAS / AD 79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."
我们可以进一步清理它:
str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]
[1] "1. OAS / AC 12345/this is a test string to regex, " "2. OAS / AD 79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. " "4. OAS / AZ 78798456/this is one mode test string to regex."
答案 1 :(得分:0)
您可以使用
library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD 79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]
结果:
dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,",
# "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
# ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
# ))
请参见regex demo。
模式详细信息
(\d+\. OAS / [A-Z]{2})
-捕获组1:
\d+
-1个以上数字\.
-一个.
OAS /
-文字 OAS /
子字符串[A-Z]{2}
-两个大写字母\s*
-超过0个空格(.*?)
-捕获组2:除换行符以外的任何0+个字符,并且尽可能少(?=\s*\d+\. OAS / [A-Z]{2}|\z)
-积极向前看:在当前位置的右侧,必须
\s*\d+\. OAS / [A-Z]{2}
-0+个空格,1个数字,.
,空格,/
,空格,两个大写字母|
-或\z
-字符串的结尾。答案 2 :(得分:0)
您描述问题的方式还不清楚,但是如果您要提取到"OAS / AC"
,
library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.
为使以上功能正常工作,句子应为字符向量中的各个字符串
如果您的目的是在两个字母子字符串和"="
之后的数字之间实际插入一个"OAS"
符号,
gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)