我正在尝试使用R中的正则表达式功能将一些推文文本解析为关键词。我有以下代码。
sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)
然而,我的一个句子有序列" \ ud83d \ udc4b"。这个序列的解析失败(错误是" utf8towcs&#34中的无效输入;)。我想用""替换这些序列。我试过替换正则表达式" \ u +",但那不匹配。我应该用什么正则表达式来匹配这个序列?感谢。
答案 0 :(得分:4)
我想你想要这样的东西,
> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"
答案 1 :(得分:0)
> sentence = RemoveNotASCII(sentence)
以下不删除ASCII字符的功能。
RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
x ##<< dataframe
){
n <- ncol(x)
z <- list()
for (j in 1:n) {
y = as.character(x[,j])
if (class(y)=="character") {
Encoding(y) <- "latin1"
y <- iconv(y, "latin1", "ASCII", sub="")
}
z[[j]] <- y
}
z = do.call("cbind.data.frame", z)
names(z) <- names(x)
return(z)
### Dataframe with non ASCII characters removed
}
答案 2 :(得分:0)
qdapRegex
包具有rm_non_ascii
功能来处理此问题:
library(qdapRegex)
tolower(rm_non_ascii(s))
## [1] "delta"