正则表达式,用于在R中拆分文本字符串

时间:2019-02-12 00:10:53

标签: r regex

我有一个很长的字符串,例如下面的示例波纹管,我正在努力寻找一个正则表达式来根据模式将其拆分为多个部分,例如:'1。 OAS / AC”和“ 2。 OAS / AD”。

这部分文字具有:

1)开头的数字不同

2)两个大写字母,从A到Z

我尝试过:

x <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

但不起作用

在此先感谢您的帮助!

示例

require(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
want <- stringr::str_split(have, "([1-9])( OAS / )([A-Z]{2})")

want <- list(
         "1. OAS / AC " = "12345/this is a test string to regex,",
         "2. OAS / AD " = "79856/this is another test string to regex,",
         "3. OAS / AE " = "87987/this is a new test string to regex.",
         "4. OAS / AZ " = "78798456/this is one mode test string to regex."
)

3 个答案:

答案 0 :(得分:1)

我们可以使用正向先行进行查找,寻找数字的模式,然后跟一个句号:

str_split(have, "(?=\\d+\\.)")

[1] ""                                                             "1. OAS / AC 12345/this is a test string to regex, "          
[3] "2. OAS / AD     79856/this is another test string to regex, " "3. OAS / AE 87987/this is a new test string to regex. "      
[5] "4. OAS / AZ 78798456/this is one mode test string to regex."

我们可以进一步清理它:

str_split(have, "(?=\\d{1,2}\\.)") %>% unlist() %>% .[-1]

[1] "1. OAS / AC 12345/this is a test string to regex, "           "2. OAS / AD     79856/this is another test string to regex, "
[3] "3. OAS / AE 87987/this is a new test string to regex. "       "4. OAS / AZ 78798456/this is one mode test string to regex." 

答案 1 :(得分:0)

您可以使用

library(stringr)
have <- "1. OAS / AC 12345/this is a test string to regex, 2. OAS / AD     79856/this is another test string to regex, 3. OAS / AE 87987/this is a new test string to regex. 4. OAS / AZ 78798456/this is one mode test string to regex."
r <- stringr::str_match_all(have, "(\\d+\\. OAS / [A-Z]{2})\\s*(.*?)(?=\\s*\\d+\\. OAS / [A-Z]{2}|\\z)")
res <- r[[1]][,3]
names(res) <- r[[1]][,2]

结果:

dput(res)
# => structure(c("12345/this is a test string to regex,", "79856/this is another test string to regex,", 
#  "87987/this is a new test string to regex.", "78798456/this is one mode test string to regex."
#  ), .Names = c("1. OAS / AC", "2. OAS / AD", "3. OAS / AE", "4. OAS / AZ"
#  ))

请参见regex demo

模式详细信息

  • (\d+\. OAS / [A-Z]{2})-捕获组1:
    • \d+-1个以上数字
    • \.-一个.
    • OAS / -文字 OAS / 子字符串
    • [A-Z]{2}-两个大写字母
  • \s*-超过0个空格
  • (.*?)-捕获组2:除换行符以外的任何0+个字符,并且尽可能少
  • (?=\s*\d+\. OAS / [A-Z]{2}|\z)-积极向前看:在当前位置的右侧,必须
    • \s*\d+\. OAS / [A-Z]{2}-0+个空格,1个数字,.,空格,/,空格,两个大写字母
    • |-或
    • \z-字符串的结尾。

答案 2 :(得分:0)

您描述问题的方式还不清楚,但是如果您要提取到"OAS / AC"

library(qdap)
beg2char(have, " ", 4)#looks for the fourth occurrence of \\s and extracts everything before it.

为使以上功能正常工作,句子应为字符向量中的各个字符串

如果您的目的是在两个字母子字符串和"="之后的数字之间实际插入一个"OAS"符号,

gsub("([A-Z])\\s*([0-9])","\\1 = \\2",have,perl=T)