在Bigquery中将Unicode解码为本地语言

时间:2017-11-07 04:44:31

标签: javascript google-bigquery decode

我们在Bigquery中收到调查Web挂钩数据。本地语言的注释被捕获为unicode,我们在该注释中确实具有特殊性。

  • 示例

    • 在调查中评论 - "别老是晚点,现场补行李费太贵"
    • 注释在BIGQUERY DATA-" \ u522b \ u8001 \ u662f \ u665a \ u70b9 \ uff0c \ u73b0 \ u573a \ u8865 \ u884c \ u674e \ u8d39 \ u592a \ u8d35"

我们找到了解码个人评论的解决方案: -

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
    with sample AS (SELECT '\u522b\u8001\u662f\u665a' AS S)
    SELECT utf8convert(s) from sample

在评论字段中实现此代码时,有数千条评论和不同的语言,它无效。

    CREATE TEMPORARY FUNCTION utf8convert(s STRING)
    RETURNS STRING
    LANGUAGE js AS """
    return unescape( ( s ) );
    """;
   SELECT Comment, utf8convert(Comment) as Convert
   FROM `airasia-nps.nps_production.NPSDashboard_Webhook_Data1`
   where Comment is not null 

运行时没有错误,但结果是在Unicode中不会更改为本地语言。 Result: local language in Unicode

  • 我试过这段代码

      CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
      IF(s NOT LIKE '%\\u%', s,
      (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
       WHERE x != ''))
      );
    
      SELECT
      original,
      DecodeUnicode(original) AS decoded
      FROM (
      SELECT trim(r'$-\u6599\u91d1\u304c\u9ad8\u3059\u304e\uff01\uff01\uff01') AS original UNION ALL
      SELECT trim(r'abcd')
      );
    

显示error我认为它是因为评论以特殊字符开头?

1 个答案:

答案 0 :(得分:1)

看看这是否有效。它通过转换为Unicode代码点然后转换为字符串,对其中包含\ u的字符串执行“手动”解码。它应该比使用JavaScript更快。

CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
  IF(s NOT LIKE '%\\u%', s,
     (SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
      FROM UNNEST(SPLIT(s, '\\u')) AS x
      WHERE x != ''))
);

SELECT
  original,
  DecodeUnicode(original) AS decoded
FROM (
  SELECT r'\u522b\u8001\u662f\u665a\u70b9\uff0c\u73b0\u573a\u8865\u884c\u674e\u8d39\u592a\u8d35' AS original UNION ALL
  SELECT r'abcd'
);

作为输出,返回别老是晚点,现场补行李费太贵abcd