R正则表达式匹配推特模式

时间:2014-09-08 07:04:00

标签: regex r

我正在尝试使用R中的正则表达式功能将一些推文文本解析为关键词。我有以下代码。

sentence = gsub("[[:punct:]]", "", sentence)
sentence = gsub("[[:cntrl:]]", "", sentence)
sentence = gsub("\\d+", "", sentence)
sentence = tolower(sentence)

然而,我的一个句子有序列" \ ud83d \ udc4b"。这个序列的解析失败(错误是" utf8towcs&#34中的无效输入;)。我想用""替换这些序列。我试过替换正则表达式" \ u +",但那不匹配。我应该用什么正则表达式来匹配这个序列?感谢。

3 个答案:

答案 0 :(得分:4)

我想你想要这样的东西,

> s <- "\ud83d\udc4b Delta"
> Encoding(s)
[1] "UTF-8"
> iconv(s, "ASCII", sub="")
[1] " Delta"
> f <- iconv(s, "ASCII", sub="")
> sentence = tolower(f)
> sentence
[1] " delta"

答案 1 :(得分:0)

> sentence = RemoveNotASCII(sentence)

以下不删除ASCII字符的功能。

RemoveNotASCII <- function#Remove all non ASCII characters
### remove column by columns non ASCII characters from a dataframe
(
  x ##<< dataframe
){
  n <- ncol(x)
  z <- list()
  for (j in 1:n) {
    y = as.character(x[,j])
    if (class(y)=="character") {
      Encoding(y) <- "latin1"
      y <- iconv(y, "latin1", "ASCII", sub="")
    }
    z[[j]] <- y
  }
  z = do.call("cbind.data.frame", z)
  names(z) <- names(x)
  return(z)
  ### Dataframe with non ASCII characters removed
}

答案 2 :(得分:0)

qdapRegex包具有rm_non_ascii功能来处理此问题:

library(qdapRegex)
tolower(rm_non_ascii(s))

## [1] "delta"