Question

我在R中有这样的推文文字。

"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this month &amp; Get free haircut cpn. https://somewebsite https://somewebsite…"

如何删除所有链接（删除重复的推文），以便下面的推文实际返回下面的字符串？

"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this month &amp; Get free haircut"

我试过这个：

gsub('https*','',test_str)

但它返回

"RT @SportClipsUT125: #SavingLivesLooksGood with #RedCross. Donate this           
month &amp; Get free haircut cpn. ://somewebsite ://somewebsite…"

Answer 1

一个简单的解决方案是更改你的gsub命令：

~~gsub("http[s]*://[[:alnum:]]*", "", test_str)这将正确删除网址，包括http和https版本~~

@ alistaire在评论中的建议实际上在更多情况下有效更易理解gsub('http\\S*', "", test_str)将删除以http开头的任何内容。它会在找到空格（URL没有）时停止

gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", test_str)删除转推

gsub("@\\w+", "", test_str)删除Atpeople

我强烈建议将您的数据放入语料库（一种特殊的数据格式），这样可以很容易地删除经常重复的单词和URL。如果你有一个数据语料库，你可以这样做：

corpus <- Corpus(VectorSource(my_data))
corpus = tm_map(corpus,content_transformer(function(x) iconv(x, to='UTF8', sub='byte')))
removeURL <- function(x) {gsub('http\\S*', "", x)}
corpus <- tm_map(corpus, content_transformer(removeURL))

关于如何做到这一切的示例的真棒链接： Text Mining Guide on Rpubs

删除以R中的单词开头的句子？

1 个答案: