我有一组推文数据,我认为它包含除英语以外的一些语言,例如泰语,它们用符号写成,如“â”,“ã”,“Ø”等。如何删除正常字母以外的其他措辞?
structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce",
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?",
"Bulls â äè æèäååééäéåèååäæäåççæäæææåãTravel sixtyãåääçæç ïçäèåéè #MH çæäæääéæã",
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great breakfast onboard with our new breakfast meals! http://t.co/957ZaLjY…",
"xdek ke flight @AirAsia Malaysia to LA... hahah..bagi la promo murah2 sikit, kompom aku beli...",
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. #AirAsia"
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L,
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54",
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -6L))
答案 0 :(得分:0)
这似乎在很大程度上起作用:
gsub("[^a-zA-Z0-9 [:punct:]]","",dat$text)
[1]“RT @AirAsia:现在您可以在船上享用#great早餐 我们的新早餐! http://t.co/957ZaLjY |“
[2]“当客户服务要求您等待103时,您知道存在问题 分钟,你的号码是42号。 #AirAsia“
[3]“不适当的宣布”
[4]“@AirAsia您的直接借记(Maybank)支付网关不是 工作。这是你正在努力解决的问题吗?“ [5]“xdek ke flight @AirAsia Malaysia to LA ... hahah..bagi la promo murah2 sikit,kompom aku beli ...“
[6]“公牛队旅行六十#MH”