我需要在R
中提取以下模式 (10 digits), prefix with 3, 5, 9 (e.g. 3234567890, 5234567890, 9234567890)
(10 digits), prefix with 4 (e.g. 4234567890)
(10 digits), prefix with 8 (e.g. 8234567890)
以及
TAM(5 digits) – e.g. TAM12345 (numbers starting with TAM and 5 digits)
E(7 digits) – e.g. E1234567 (numbers starting with E and only 7 digits)
A(5 digits) – e.g. A12345 (numbers starting with A and only 5 digits)
我使用stingr库。
我能够提取数字(带alpha) - 不知道如何给出特定的前缀并限制数字
电子邮件位于
之下These are the notice number - with high priority
3234567890 and 5234567890 and the long pending issue 9234567890 along with the discuused numbers 4234567890,8234567890.
Special messages from TAM12345,E1234567 and A12345
必需的输出
3234567890, 5234567890, 9234567890
4234567890
8234567890
TAM12345
E1234567
A12345
答案 0 :(得分:2)
您可以尝试以下使用字边界\b
的代码。字边界用于匹配单词字符和非单词字符。
> library(stringr)
> str_extract_all(x, perl('\\b(?:[35948]\\d{9}|TAM\\d{5}|E\\d{7}|A\\d{5})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
答案 1 :(得分:2)
使用stringr
库:
> library(stringr)
> str_extract_all(x, perl('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
使用gsubfn
库:
> library(gsubfn)
> strapply(x, '\\b([3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', perl=T)
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"
基地R也可以处理这个问题。
> regmatches(x, gregexpr('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', x, perl=T))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345" "E1234567" "A12345"