从文本中删除所有标点符号,包括用于tm包的撇号

时间:2018-11-20 12:14:23

标签: r text-mining tm

我有一个向量,由向量(由消息文本组成)组成,我正在清理这些数据以用于文本挖掘。我从removePunctuation包中使用了tm,如下所示:

clean_tweet_text = removePunctuation(tweet_text)

这导致一个矢量,其中所有的标点符号都从文本 中除去了撇号,这使我的关键字搜索陷入了困境,因为未注册带有撇号的单词。例如,我的一个关键字是climate,但是如果一条推文中有'climate,则不会被计算在内。

如何从向量中删除所有撇号/单引号?

下面是dput的标题,以提供一个可重复的示例:

c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap", 
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…", 
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…", 
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl", 
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck", 
"unusual warming kills gulf of maine cod  discovery news globalwarming  httpstco39uvock3xe", 
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", 
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
)

1 个答案:

答案 0 :(得分:4)

要删除所有标点符号(包括撇号和单引号),只需使用gsub()

x <- c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
       "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
       "rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
       "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
       "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
       "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
       "ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
       "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
       "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
       "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o")

gsub("[[:punct:]]", "", x)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"

reprex package(v0.2.1)于2018-11-20创建

gsub()将其在第三个参数中出现的所有第一个参数替换为第二个参数(请参见help("gsub"))。在这里,这意味着它将用x替换集合[[:punct:]]中任何字符在向量""中的所有出现(删除它们)。

该删除哪些字符?来自help("regex")

  

[:punct:]

     

标点符号:
  ! “#$%&'()* +,-。/:; <=>?@ [\] ^ _ _ {{}〜。

更新

出现这种情况是因为您的撇号就像而不是'。因此,如果您想坚持使用tm::removePunctuation(),也可以使用

tm::removePunctuation(x, ucp = TRUE)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"