Question

我有一个很长的字符串，例如下面的示例波纹管，我正在努力寻找一个正则表达式来根据模式将其拆分为多个部分，例如：'1。 OAS / AC”和“ 2。 OAS / AD”。

这部分文字具有：

1）开头的数字不同

2）两个大写字母，从A到Z

我尝试过：

x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

但不起作用

在此先感谢您的帮助！

示例

require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

want <- list(
         "1. OAS / AC " = "12345/this is a test string to regex,",
         "2. OAS / AD " = "79856/this is another test string to regex,",
         "3. OAS / AE " = "87987/this is a new test string to regex.",
         "4. OAS / AZ " = "78798456/this is one mode test string to regex."
)

Answer 1

我们可以使用正向先行进行查找，寻找数字的模式，然后跟一个句号：

str_split(have, "(?=\\d+\\.)")

[1] ""                                                             "1. OAS / AC 12345/this is a test string to regex, "          
[3] "2. OAS / AD     79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "      
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."

我们可以进一步清理它：

str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]

[1] "1. OAS / AC 12345/this is a test string to regex, "           "2. OAS / AD     79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. "       "4. OAS / AZ 78798456/this is one mode test string to regex."

Answer 2

您可以使用

library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]

结果：

dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,", 
#  "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
#  ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
#  ))

请参见regex demo。

模式详细信息

(\d+\. OAS / [A-Z]{2})-捕获组1：
- \d+-1个以上数字
- \.-一个.
- OAS / -文字 OAS / 子字符串
- [A-Z]{2}-两个大写字母
\s*-超过0个空格
(.*?)-捕获组2：除换行符以外的任何0+个字符，并且尽可能少
(?=\s*\d+\. OAS / [A-Z]{2}|\z)-积极向前看：在当前位置的右侧，必须
- \s*\d+\. OAS / [A-Z]{2}-0+个空格，1个数字，.，空格，/，空格，两个大写字母
- |-或
- \z-字符串的结尾。

Answer 3

您描述问题的方式还不清楚，但是如果您要提取到"OAS / AC"，

library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.

为使以上功能正常工作，句子应为字符向量中的各个字符串

如果您的目的是在两个字母子字符串和"="之后的数字之间实际插入一个"OAS"符号，

gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)

正则表达式，用于在R中拆分文本字符串

3 个答案: