我有一个非常具体的问题。我有一组包含电子邮件(和电子邮件链)的PDF文件,通常具有以下格式:
From: Doe, John <john.doe@mail.com>
To: Doe, Jane <john.doe@mail.com>; Doe, John
Subject: Re: Title
text ...
...
From: Doe, John <john.doe@mail.com>
To: Doe, Jane <john.doe@mail.com>; Doe, John
CC: Moe, James; Klein, John
Subject: Title
text ...
因此,在一个PDF文件中,您通常会有几个“从”,“到”和“cc”块。名称的格式始终是姓氏和名字由逗号分隔。不同的名称由分号分隔。但是,有时完整的电子邮件地址(我不需要)将包含在“&lt;”之间和“&gt;”。我想从这些PDF文件中提取所有名称(在from,to和cc部分中),最后输出如下:
Last name first name
Doe John
Doe Jane
Moe James
Klein John
我已设法使用pdftools
包读取PDF文件:
files <- list.files(pattern = "pdf$")
pdfs <- lapply(files, pdf_text)
但是,我目前有点陷入困境,试图找到提取所有名称并将其保存在数据框中的最佳方法。我一直在关注str_extract
功能:例如从str_extract(pdfs[[1]], regex("From.*To", ignore_case = TRUE))
开始,但未能找到有效的解决方案。任何帮助将非常感激。例如,假设pdfs[[1]]
包含以下字符串:
teststring <- "From: Doe, John <john.doe@mail.com>\r\n
To: Doe, Jane <john.doe@mail.com>; Doe, John\r\n
Subject: Re: Title\r\n
text ...\r\n
...\r\n
From: Doe, John <john.doe@mail.com>\r\n
To: Doe, Jane <john.doe@mail.com>; Doe, John\r\n
CC: Moe, James; Klein, John\r\n
Subject: Title\r\n
text ...\r\n"
答案 0 :(得分:1)
使用return self.splitSeconds(self.seconds) # edited based on comments
teststring
输出
library(stringr)
fullnames <- unique(c(str_extract_all(teststring, "[a-zA-Z]+,\\s[a-zA-Z]+", simplify=TRUE)))
splitnames <- unlist(strsplit(fullnames, ","))
ans <- data.frame(Last=splitnames[c(TRUE,FALSE)], First=splitnames[c(FALSE,TRUE)])