Question

我在Spark（PySpark和Pandas）上使用Python2来分析有关表情符号用法的数据。我有一个像u'u+1f375'或u'u+1f618'这样的字符串，我想分别转换为和。

我已经阅读了其他几篇SO帖子和unicode HOWTO，试图抓住encode和decode无效。

这没有用：

decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'

这最终以一次性方式工作，但是当我将其应用于我的RDD时失败了。

def rename_if_emoji(pattern):
  """rename the element name of dataframe with emoji"""

  if pattern.lower().startswith("u+"):
    emoji_string = ""
    EMOJI_PREFIX = "u+"
    for part_org in pattern.lower().split(" "):
      part = part_org.strip();
      if (part.startswith(EMOJI_PREFIX)):
        padding = "0" * (8 + len(EMOJI_PREFIX) - len(part)) 
        codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
        print("codepoint: " + codepoint)
        emoji_string += codepoint.decode('unicode-escape')
        print("emoji_string: " + emoji_string)
    return emoji_string
  else:
    return pattern

rename_if_emoji_udf = udf(rename_if_emoji)

错误：UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001f618' in position 14: ordinal not in range(128)

Answer 1

正确打印表情符号的能力取决于所使用的IDE /终端。由于Python 2的UnicodeEncodeError编码Unicode字符串到终端的编码，您将在不受支持的终端上获得print。您还需要字体支持。您的错误在print上。您已正确解码，但理想情况下您的输出设备应支持UTF-8。

该示例简化了解码过程。我打印字符串的repr()，以防终端未配置为支持正在打印的字符。

import re

def replacement(m):
    '''Assume the matched characters are hexadecimal, convert to integer,
       format appropriately, and decode back to Unicode.
    '''
    i = int(m.group(1),16)
    return '\\U{:08X}'.format(i).decode('unicode-escape')

def replace(s):
    '''Replace all u+nnnn strings with the Unicode equivalent.
    '''
    return re.sub(ur'u\+([0-9a-fA-F]+)',replacement,s)

s = u'u+1f618 u+1f375'
t = replace(s)
print repr(t)
print t

输出（在UTF-8 IDE上）：

u'\U0001f618 \U0001f375'

在python中将表情符号的unicode字符串表示转换为unicode表情符号

1 个答案: