Question

我正在使用R处理Twitter数据，并试图从推文中删除所有正确的英语单词。我们的想法是查看我记录的推文中特定人群所使用的口语缩写，拼写错误和俚语。

示例：

    tweet <- c("Trying to find the solution frustrated af")

在上述操作之后，我想只有'af'

我想把字母（我会下载）上的字母清洗掉，但必须有一个更简单的选择。 Python中的任何解决方案也会有所帮助。

Answer 1

另一种基于hunspell的解决方案，使用了一种相当新的＆amp;有趣的package：

# install.packages("hunspell") # uncomment & run if needed
library(hunspell)
tweet <- c("Trying to find the solution frustrated af")
( tokens <- strsplit(tweet, " ")[[1]] )
# [1] "Trying"     "to"         "find"       "the"        "solution"   "frustrated" "af"        
tokens[!hunspell_check(tokens), dict = "en_US"]
# [1] "af"

从R中的推文中删除正确的英语单词

1 个答案: