Question

我需要在R

中提取以下模式

 (10 digits), prefix with 3, 5, 9 (e.g. 3234567890, 5234567890, 9234567890) 
 (10 digits), prefix with 4 (e.g. 4234567890)
 (10 digits), prefix with 8 (e.g. 8234567890)

以及

 TAM(5 digits) – e.g. TAM12345 (numbers starting with TAM and 5 digits)
 E(7 digits) – e.g. E1234567 (numbers starting with E and only 7 digits)
 A(5 digits) – e.g. A12345 (numbers starting with A and only 5 digits)

我使用stingr库。

我能够提取数字（带alpha） - 不知道如何给出特定的前缀并限制数字

电子邮件位于

之下

These are the notice number - with high priority
3234567890 and 5234567890 and the long pending issue 9234567890 along with the discuused numbers 4234567890,8234567890.
Special messages from TAM12345,E1234567 and A12345

必需的输出

3234567890, 5234567890, 9234567890
4234567890
8234567890
TAM12345
E1234567
A12345

Answer 1

您可以尝试以下使用字边界\b的代码。字边界用于匹配单词字符和非单词字符。

> library(stringr)
> str_extract_all(x, perl('\\b(?:[35948]\\d{9}|TAM\\d{5}|E\\d{7}|A\\d{5})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345"   "E1234567"   "A12345"

Answer 2

使用stringr库：

> library(stringr)
> str_extract_all(x, perl('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b'))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345"   "E1234567"   "A12345"

使用gsubfn库：

> library(gsubfn)
> strapply(x, '\\b([3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', perl=T)
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345"   "E1234567"   "A12345"

基地R也可以处理这个问题。

> regmatches(x, gregexpr('\\b(?:[3-589]\\d{9}|(?:TAM|A)\\d{5}|E\\d{7})\\b', x, perl=T))
[[1]]
[1] "3234567890" "5234567890" "9234567890" "4234567890" "8234567890"
[6] "TAM12345"   "E1234567"   "A12345"

使用特定前缀提取并控制编号。 Regex-R中的数字

2 个答案: