TM包中的gsub函数删除URLS不会删除整个字符串

时间:2016-12-12 21:20:24

标签: r tm

我在使用r文本挖掘包(tm)的脚本中使用此函数来从推文中删除URL。令我惊讶的是,在清理之后,有一些剩余的“http”单词以及来自URL本身的片段(例如t.co)。看起来有些URL被彻底消灭了,而其他一些只是分解成组件。可能是什么原因?注意:我拿了。在t.co网址中。 StackOverflow不允许将URL提交到t.co地址。

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
removeURL <- function(x) gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
清洁前的

文字

VOTE TODAY! Go to https://tco/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://tco/KPQ5EY9VwQ

清理后的

文字

vote today go https tco mxraxyntjy find polling location going make america great https tco kpqeyvwq

1 个答案:

答案 0 :(得分:7)

您正在删除removeURL函数正在查找的符号。此外,您需要确保使用content_transformer()创建正确的变换器功能。这是一个工作示例,其中包含用于删除URL的不同正则表达式(它在空格处停止)

library(tm)
test<-"VOTE TODAY! Go to https://t.com/KPQ5EY9VwQ to find your polling location. We are going to Make America Great Again!… https://t.com/KPQ5EY9VwQ"

trumpcorpus1020to1109 <- VCorpus(VectorSource(test))
removeURL <- content_transformer(function(x) gsub("(f|ht)tp(s?)://\\S+", "", x, perl=T))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, removeURL)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "/")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "@")
trumpcorpus1020to1109 <- tm_map(trumpcorpus1020to1109, toSpace, "\\|")
content(trumpcorpus1020to1109[[1]])
# [1] "VOTE TODAY! Go to  to find your polling location. We are going to Make America Great Again!… "