使用R从文本中提取字符串

时间:2014-12-24 10:01:24

标签: r grep

下面的输入是文本文件。以下是数据输入

From: abc@xyz.com

To: qwe@xyz.com, ewq@xyz.com

tuu@xyz.com, vbn@xyz.com

lkj@xyz.com, jkl@xyz.com

Subject: Introduction to R

B-CC: qwe@xyz.com, ewq@xyz.com

tuu@xyz.com, vbn@xyz.com

lkj@xyz.com, jkl@xyz.com

必需输出:

我需要将所有邮件ID转换为To和B-CC中的一个对象。挑战是所有的电子邮件ID都不在同一行中的不同行。需要将所有电子邮件ID复制到一个对象

To: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, lkj@xyz.com, jkl@xyz.com

B-CC: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, lkj@xyz.com, jkl@xyz.com

4 个答案:

答案 0 :(得分:2)

你可以这样做:

library(stringr)
str1 <- paste(str_trim(lines), collapse=', ')
str_extract_all(str1, perl('(?=To: ).*(?=, Subject)'))[[1]]
#[1] "To: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com,
#lkj@xyz.com, jkl@xyz.com"
str_extract_all(str1, perl('(?=B-CC:).*'))[[1]]
#[1] "B-CC: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, 
#lkj@xyz.com, jkl@xyz.com"

或使用stringi

 library(stringi)
 stri_extract_all_regex(str1, '(?=To: ).*(?=, Subject)')[[1]]
 #[1] "To: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com,
 # lkj@xyz.com, jkl@xyz.com"

 stri_extract_all_regex(str1, '(?=B-CC:).*')[[1]]
 #[1] "B-CC: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, 
 #lkj@xyz.com, jkl@xyz.com"

数据

 lines <- readLines(n=8)
 From: abc@xyz.com
 To: qwe@xyz.com, ewq@xyz.com
 tuu@xyz.com, vbn@xyz.com
 lkj@xyz.com, jkl@xyz.com
 Subject: Introduction to R
 B-CC: qwe@xyz.com, ewq@xyz.com
 tuu@xyz.com, vbn@xyz.com
 lkj@xyz.com, jkl@xyz.com

答案 1 :(得分:2)

与@ akrun相同,但几乎没有任何修改。

> library(stringr)
> lines <- readLines(n=8)
From: abc@xyz.com
To: qwe@xyz.com, ewq@xyz.com
tuu@xyz.com, vbn@xyz.com
lkj@xyz.com, jkl@xyz.com
Subject: Introduction to R
B-CC: qwe@xyz.com, ewq@xyz.com
tuu@xyz.com, vbn@xyz.com
lkj@xyz.com, jkl@xyz.com
> str1 <- paste(str_trim(lines), collapse=', ')
> str_extract_all(str1, perl('(?=To:\\s+).*?(?=,\\s+\\w+:|$)'))[[1]]
[1] "To: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, lkj@xyz.com, jkl@xyz.com"
> str_extract_all(str1, perl('(?=B-CC:\\s+).*?(?=,\\s+\\w+:|$)'))[[1]]
[1] "B-CC: qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, lkj@xyz.com, jkl@xyz.com"

答案 2 :(得分:1)

读入行并为每行没有带空格的冒号前缀。结果将采用DCF格式,因此我们可以使用read.dcf读取它,用逗号和空格替换任何换行符。生成的结构将包含FromToSubjectB-CC组件。

Lines <- readLines("myfile.txt")

hasColon <- grepl(":", Lines)
Lines[!hasColon] <- paste("", Lines[!hasColon])

email <- read.dcf(textConnection(Lines))[1, ]
email <- gsub("\n", ", ", email)

,并提供:

> email[['To']]
[1] "qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, lkj@xyz.com, jkl@xyz.com"
> email[['B-CC']]
[1] "qwe@xyz.com, ewq@xyz.com, tuu@xyz.com, vbn@xyz.com, lkj@xyz.com, jkl@xyz.com"

答案 3 :(得分:0)

cat input | sed 's/: /\n/' | awk '/To/{flag=1;next}/Subject/{flag=0}flag' > to.txt
cat input | sed 's/: /\n/' | awk '/B-CC/{flag=1;next}/FINISH/{flag=0}flag' > bcc.txt

如果我理解你的问题,这对你有用。