表情符号编码在R中

时间:2017-10-27 11:59:19

标签: r encoding utf-8 emoji iso-8859-1

这是我第一次处理包含表情符号字符的文本,所以我的问题可能非常基本,但我还没找到解决方案。

我在Android上从WhatsApp导出了一个.txt文件并发送到我的电脑(Windows)。数据看起来像这样:

chat <- c("05.10.17, 22:55 - Person A: Hey, whats up? 😳😄","05.10.17, 22:55 - Person A: I heard about your problem 😅🙄😂","05.10.17, 22:56 - Person B: What? From whom?🙈","05.10.17, 22:57 - Person A: Your mom...","05.10.17, 22:59 - Person B: ðŸ˜ðŸ˜„","05.10.17, 22:59 - Person B: ðŸ˜ðŸ˜„")
chat

[1] "05.10.17, 22:55 - Person A: Hey, whats up? 😳😄"                
[2] "05.10.17, 22:55 - Person A: I heard about your problem 😅🙄😂"
[3] "05.10.17, 22:56 - Person B: What? From whom?🙈"                   
[4] "05.10.17, 22:57 - Person A: Your mom..."                            
[5] "05.10.17, 22:59 - Person B: ðŸ˜ðŸ˜„"    
[6] "05.10.17, 22:59 - Person B: ðŸ˜ðŸ˜„"

我想要做的是用易于阅读的文字表示替换表情符号字符,所以我想要这样的东西:

[1] "05.10.17, 22:55 - Person A: Hey, whats up? [[SMILEY1]][[SMILEY2]]"                
[2] "05.10.17, 22:55 - Person A: I heard about your problem [[SMILEY2]][[SMILEY3]][[SMILEY2]]"
[3] "05.10.17, 22:56 - Person B: What? From whom?[[SMILEY3]]"                   
[4] "05.10.17, 22:57 - Person A: Your mom..."                            
[5] "05.10.17, 22:59 - Person B: [[SMILEY2]][[SMILEY2]]"    
[6] "05.10.17, 22:59 - Person B: [[SMILEY2]][[SMILEY2]]"

我在网上找到了一本字典,将Emojis映射到文字描述here,所以我把它导入了R:

# Import the List of all WhatsApp Emojis including their description
Emojis <- read.csv(url("https://raw.githubusercontent.com/iorch/jakaton_feminicidios/master/data/emojis.csv"),header = T, encoding = "UTF-8", stringsAsFactors = FALSE)
Emojis

# Pimp Description for better visibility later on
Emojis[,2] <- paste("[[Emoji:",Emojis[,2], "]]")
Emojis

现在,我尝试将Emoji[,1]chat的所有匹配替换为Emoji [,2]

require(qdapRegex)

CleanMessage <- chat

for(i in seq_along(Emojis[,1])){

      CleanMessage <- lapply(CleanMessage,rm_default, clean = TRUE, pattern = Emojis[i,1], replacement = Emojis[i,2])
}

CleanMessage

然而,我的输出看起来完全一样......有人能指出我的错误吗?

0 个答案:

没有答案