我有一列这样的内容:
> PREFI.(S): NETWORK SA|ADV.(A/S):JOHN SMITH SANT'ANNA (30652/RS) AND OTHER(A/S)|RECDO.(A/S): CLAUDIA TRROMMER|ADV.(A/S): LOUISE (52417/RS)
> PREFI.(S): RUTH SEIXAS|ADV.(A/S): LOPES SOUTO (47706/RS)|RECDO.(A/S): MARTINS (64285/RS)
我想: 1)以|分隔值 2)仅获取“)”或“:”之间的文本以及该行的非字母字符/结尾
结果将是:
NETWORK SA
JOHN SMITH
AND OTHER
CLAUDIA TRROMMER
LOUISE RUTH
等
我认为我已经成功完成了第一部分
docs <- str_split(processos$partes,"\\|")
但是我无法弄清最后一部分-即使在尝试使用正则表达式向后/向前尝试
答案 0 :(得分:1)
解决方案:
> library(tidyverse)
> x <- "
+ > PREFI.(S): NETWORK SA|ADV.(A/S):JOHN SMITH SANT'ANNA (30652/RS) AND OTHER(A/S)|RECDO.(A/S): CLAUDIA TRROMMER|ADV.(A/S): LOUISE (52417/RS) ..." ... [TRUNCATED]
> # split on "|"
> xs <- str_split(x, "\\|")[[1]]
> # extract the data
> str_extract_all(xs, "\\):[ a-zA-Z]*") %>%
+ unlist() %>%
+ sub("^..", "", .) # get rid of "):"
[1] " NETWORK SA" "JOHN SMITH SANT" " CLAUDIA TRROMMER"
[4] " LOUISE " " RUTH SEIXAS" " LOPES SOUTO "
[7] " MARTINS "