正确读取Unicode表情符号到R

时间:2017-12-06 13:53:51

标签: r text unicode utf-8 emoji

我有一组来自Facebook的评论(通过像Sprinkr这样的系统提取)包含文本和表情符号,我试图在R中对它们进行各种分析,但是在摄取表情符号时遇到了困难字符正确。

例如:我有一个.csv(用UTF-8编码),它的消息行包含这样的内容:

“这是正确的吗??!?!请说这不是真的!我们家只吃原来的Reeses花生酱杯”

然后我按照以下方式将其摄入R:

library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
                            locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\n\n"

现在,根据我从其他来源的理解,我需要将这个UTF-8转换为ASCII,然后我可以用它将其与其他表情符号资源(如精彩的emojidictionary)链接起来。为了使连接起作用,我需要将其转换为R编码,如下所示:

<e2><9d><a4><ef><b8><8f>

然而,添加正常步骤(使用iconv)并没有让我在那里:

fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook") %>%
  mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>\n\n"

那里的任何人都可以向我说明我缺少的东西,或者我是否需要找到不同的表情符号映射资源?谢谢!

1 个答案:

答案 0 :(得分:2)

目标不是很清楚,但我怀疑放弃表示表情符号是正确的并且只是将其表示为字节并不是最好的方法。例如,如果您希望将表情符号转换为其描述,您可以执行以下操作:

x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups"

## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
          encoding="UTF-8") %>%
    stri_subset_regex(pattern = "^[^#]") %>%
    stri_subset_regex(pattern = ".+") -> emoji

## get the emoji characters and clean them up
emoji %>%
    stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
    stri_replace_all_fixed(pattern = c("*", "#"),
                           replacement = "",
                           vectorize_all=FALSE) %>%
    stri_trim_both() -> emoji.chars

## get the emoji character descriptions
emoji %>%
    stri_extract_all_regex(pattern = "#.*$") %>%
    stri_replace_all_regex(pattern = "# *.{1,2} *",
                           replacement = "") %>%
    stri_trim_both() -> emoji.descriptions


## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
                       pattern = emoji.chars,
                       replacement = emoji.descriptions,
                       vectorize_all=FALSE)

## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"