我在R中有一个Vector Corpus。我想删除该语料库中出现的所有电子邮件ID。电子邮件ID可以位于语料库中的任何位置。 比如说
1> "Could you mail me the Company policy amendments at xyz@gmail.com. Thank you."
2> "Please send me an invoice copy at abcdef@yahoo.co.in. Looking forward to your reply".
所以在这里我想要将电子邮件ID“xyz@gmail.com”和“abcdef@yahoo.co.in”仅从语料库中删除。
我尝试过使用:
corpus <- tm_map(corpus,removeWords,"\w*gmail.com\b")
corpus <- tm_map(corpus,removeWords,"\w*yahoo.co.in\b")
答案 0 :(得分:3)
以下代码使用正则表达式模式从语料库中删除电子邮件ID。我从某些地方得到了正则表达式,目前无法回忆它的来源。我本来希望对来源表示赞赏。
# Sample data from which email ids need to be removed
text <- c("Could you mail me the Company policy amendments at xyz@gmail.com. Thank you.",
"Please send me an invoice copy at abcdef@yahoo.co.in. Looking forward to your reply." )
#Function containing regex pattern to remove email id
RemoveEmail <- function(x) {
require(stringr)
str_replace_all(x,"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+", "")
}
library(tm)
corpus = Corpus(VectorSource(text)) # Corpus creation
corpus <- tm_map(corpus,content_transformer(RemoveEmail)) # removing email ids
#Printing the corpus
corpus[[1]]$content
corpus[[2]]$content
答案 1 :(得分:0)
在特定列中用无效的电子邮件删除R中的所有行:
DF <- subset(DF, Column!="[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")