我们在Bigquery中收到调查Web挂钩数据。本地语言的注释被捕获为unicode和特殊字符。我已经编写了将unicode转换为本地语言的函数,并使用正则表达式来避免使用特殊字符。
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''
)
);
WITH NPSDashboard_Webhook_Data1_copy AS (
SELECT
TRIM(Comment) Comment
FROM
`radiant-micron-790.Sharmila_Testing.NPSDashboard_Webhook_Data1_copy`
)
,
uchars AS (
SELECT DISTINCT
c,
DecodeUnicode(c) uchar
FROM NPSDashboard_Webhook_Data1_copy,
UNNEST(REGEXP_EXTRACT_ALL(Comment, r'(\\u[abcdef0-9]{4})')) c
)
SELECT
Comment,
STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) Decoded
FROM (
SELECT
Comment,
pos,
SUBSTR(Comment,
SUM(CASE char WHEN '' THEN 1 ELSE 6 END)
OVER(PARTITION BY Comment ORDER BY pos) - CASE char WHEN '' THEN 0 ELSE 5
END,
CASE char WHEN '' THEN 1 ELSE 6 END) x,
uchar
FROM NPSDashboard_Webhook_Data1_copy,
UNNEST(REGEXP_EXTRACT_ALL(Comment, r'(\\u[abcdef0-9]{4})|.')) char WITH OFFSET AS pos
LEFT JOIN uchars u ON u.c = char
)
GROUP BY Comment
返回错误: -
查询失败
错误:无效的代码点55357
我发现" \ ud83c \ udf38"它的返回错误是"樱花"在表情符号返回错误。如何使用正则表达式或转换器来解决这个问题?
答案 0 :(得分:0)
我认为你不能用纯SQL做到这一点。
我建议将UTF-16 Emojis转换为HTML实体(十六进制)以将它们存储在数据库中。很可能你需要使用编程语言来这样做:
在.NET中尝试这样做
using System;
using System.Text;
using System.Globalization;
using System.Net;
public class Program
{
public static void Main()
{
Console.WriteLine(WebUtility.HtmlEncode("\uD83D\uDE02"));
}
}
String line = "Hi , i am fine \uD83D\uDE02 \uD83D\uDE02, how r u ?";
EmojiUtils.hexHtmlify(line); //Hi , i am fine 😂 😂, how r u ?