在R中提取文本字符串

时间:2018-09-04 04:35:33

标签: r regex

我有一列这样的内容:

> PREFI.(S): NETWORK SA|ADV.(A/S):JOHN SMITH SANT'ANNA (30652/RS) AND OTHER(A/S)|RECDO.(A/S): CLAUDIA TRROMMER|ADV.(A/S): LOUISE (52417/RS)

> PREFI.(S): RUTH SEIXAS|ADV.(A/S): LOPES SOUTO (47706/RS)|RECDO.(A/S): MARTINS (64285/RS)

我想: 1)以|分隔值 2)仅获取“)”或“:”之间的文本以及该行的非字母字符/结尾

结果将是:

 NETWORK SA 
 JOHN SMITH
 AND OTHER
 CLAUDIA TRROMMER
 LOUISE RUTH

我认为我已经成功完成了第一部分

docs <- str_split(processos$partes,"\\|")

但是我无法弄清最后一部分-即使在尝试使用正则表达式向后/向前尝试

1 个答案:

答案 0 :(得分:1)

使用 tidyverse stringr 函数的

解决方案:

> library(tidyverse)

> x <- "
+ > PREFI.(S): NETWORK SA|ADV.(A/S):JOHN SMITH SANT'ANNA (30652/RS) AND OTHER(A/S)|RECDO.(A/S): CLAUDIA TRROMMER|ADV.(A/S): LOUISE (52417/RS) ..." ... [TRUNCATED] 

> # split on "|"
> xs <- str_split(x, "\\|")[[1]]

> # extract the data
> str_extract_all(xs, "\\):[ a-zA-Z]*") %>%
+   unlist() %>%
+   sub("^..", "", .)  # get rid of "):"
[1] " NETWORK SA"       "JOHN SMITH SANT"   " CLAUDIA TRROMMER"
[4] " LOUISE "          " RUTH SEIXAS"      " LOPES SOUTO "    
[7] " MARTINS "