我有doc格式的电子邮件数据,每种doc格式都有一封电子邮件。
EX:
From,
Mr.Joseph,
Sales Head,
Wall Mart,
London
To,
Ms Rebecca,
Junior sales person,
Wall Mart,
London
Dear Ms Rebecca,
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the
1500s, when an unknown printer took a galley of type and scrambled it to
make a type specimen book. It has survived not only five centuries, but also
the leap into electronic typesetting, remaining essentially unchanged. It
was popularised in the 1960s with the release of Letraset sheets containing
Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.
Yours faithfully
Joseph
我成功提取了此电子邮件的正文。
代码:
docfile <- readtext("letter.doc")$text
docfile <- strsplit(docfile, "\n")[[1]]
startbody <- grep(c("Dear Ms Rebecca,"), docfile)[1]
endbody <- grep("Yours faithfully,", docfile)[1]
mainbody <- paste(docfile[startbody:(endbody-1)], collapse=" ")
finalbody <- str_replace_all(mainbody, "\r ","")
现在的问题是,有很多这样的邮件,其中有“尊敬的妈妈”或“亲爱的丽贝卡女士”的地方。
因此,基本上我想说的是我编写的代码仅适用于此特定doc文件,并且每个doc文件都与其他文件不同。
我想提取我拥有的每个文档文件的“从”,“到”和“正文”,但是此代码仅适用于该特定文件。
希望我把问题弄清楚了,任何输入或帮助都会对我有帮助。
谢谢
答案 0 :(得分:0)
您的解决方案是使用crfsuite R软件包。它允许您构建条件随机字段。这些类型的模型可用于根据您要考虑的类别对文本块进行分类。一类可以例如是“邮件的开头”,另一个是“邮件的结尾” ... crfsuite小插曲https://cran.r-project.org/web/packages/crfsuite/index.html中提供了示例 我正在将R包用于您所问的同一问题。