删除除正常英文字母以外的其他措辞

时间:2014-04-14 04:17:29

标签: r

我有一组推文数据,我认为它包含除英语以外的一些语言,例如泰语,它们用符号写成,如“â”,“ã”,“Ø”等。如何删除正常字母以外的其他措辞?

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?", 
"Bulls â äè æèäååééäéåèååäæäåççæäæææåãTravel sixtyãåääçæç ïçäèåéè #MH çæäæääéæã", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great breakfast onboard with our new breakfast meals! http://t.co/957ZaLjY…", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. #AirAsia"
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

1 个答案:

答案 0 :(得分:0)

这似乎在很大程度上起作用:

gsub("[^a-zA-Z0-9 [:punct:]]","",dat$text)
  

[1]“RT @AirAsia:现在您可以在船上享用#great早餐   我们的新早餐! http://t.co/957ZaLjY |“
  [2]“当客户服务要求您等待103时,您知道存在问题   分钟,你的号码是42号。 #AirAsia“
  [3]“不适当的宣布”
  [4]“@AirAsia您的直接借记(Maybank)支付网关不是   工作。这是你正在努力解决的问题吗?“   [5]“xdek ke flight @AirAsia Malaysia to LA ... haha​​h..bagi la promo murah2   sikit,kompom aku beli ...“
  [6]“公牛队旅行六十#MH”