使用R中的regex从电子邮件中提取名称

时间:2015-06-10 05:44:14

标签: regex r

我有一个字符串 - 这是电子邮件链,我需要提取发件人(From :)的名称。在下面找到电子邮件样本

str1 <- 'From : Wendy YEOW (SLA) To : xxxx@lt.org Subject : RE: OneService@S
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S 
From: Siti Zaharah RAMAN (ARKS) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S 
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S 
From: Chin Hwang LAU (TA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx@lt.org Subject : RE: OneService@S'

我有以下代码 - 提取名称

str_extract_all(string=str1,pattern="\\b(From\\s*[:]+\\s*(\\w*))\\b")[[1]]
[1] "From : Wendy" "From: SLA"    "From: Siti"   "From: SLA"    "From: Chin"

但我想要的输出是:

[1] "Wendy YEOW (SLA)"    "SLA Enquiry (SLA)"    "Siti Zaharah RAMAN (ARKS)"   "SLA Enquiry (SLA)"    "Chin Hwang LAU (TA)"

3 个答案:

答案 0 :(得分:3)

strsplit()一起试用此正则表达式:

gsub("From *: (.*?) (To|Sent).*", "\\1", strsplit(str1, "\n")[[1]])

[1] "Wendy YEOW (SLA)"         
[2] "SLA Enquiry (SLA)"        
[3] "Siti Zaharah RAMAN (ARKS)"
[4] "SLA Enquiry (SLA)"        
[5] "Chin Hwang LAU (TA)" 

这是有效的,因为我使用后引用(\\1)来提取第一组括号中的通配符。

答案 1 :(得分:3)

您可以使用strsplit。这里不需要gsub

strsplit(str1, "From ?: | (To|Sent) ?:.*?(\\nFrom ?: |$)")[[1]][-1]
# [1] "Wendy YEOW (SLA)"          "SLA Enquiry (SLA)"         "Siti Zaharah RAMAN (ARKS)"
# [4] "SLA Enquiry (SLA)"         "Chin Hwang LAU (TA)"  

正则表达式基本上由两部分组成:

  1. "From ?: ":这是字符串的开头。拆分返回一个空字符串和原始字符串的其余部分。
  2. " (To|Sent) ?:.*?(\\nFrom ?: |$)":此正则表达式表示名称后面的文本。它包含以"To""Sent"开头并以换行符("\\n")结尾的子字符串,后跟下一个"From"或字符串的结尾({{1} })。
  3. 最后,删除空字符串(在第一个"$"之前)需要[-1]

答案 2 :(得分:1)

不太优雅,但你可以试试:

gsub(" *(From|To|Sent) *:? *","",regmatches(str1,gregexpr("From *:[^:]+",str1))[[1]])
#[1] "Wendy YEOW (SLA)"          "SLA Enquiry (SLA)"        
#[3] "Siti Zaharah RAMAN (ARKS)" "SLA Enquiry (SLA)"        
#[5] "Chin Hwang LAU (TA)"