bigquery url解码

时间:2012-12-12 01:32:32

标签: google-bigquery

是否有一种简单的方法可以在BigQuery查询语言中进行URL解码?我正在使用一个包含某些值中包含URL编码字符串的列的表。例如:

http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz

我像这样提取“url”参数:

SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url 
from [mydataset.mytable]

给了我:

http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345

我想做的是:

SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url 
from [mydataset.mytable]

从而返回:

http://www.example.com/hello?v=12345

如果可能,我想避免使用多个REGEXP_REPLACE()语句(替换%20,%3A等...)。

想法?

4 个答案:

答案 0 :(得分:2)

这是一个很好的功能请求,但目前没有内置的BigQuery功能提供URL解码。

答案 1 :(得分:2)

另一个解决方法是使用用户定义的函数。

#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
  try { 
    return decodeURI(enc);;
  } catch (e) { return null }
  return null;
""";

SELECT ven_session, 
  URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327 
WHERE para like '%&kw=%'
LIMIT 10

答案 2 :(得分:2)

以下内容是在@sigpwned答案的基础上构建的,但是使用SQL UDF对其进行了稍微的重构和包装(不限制JS UDF可以安全使用)

  
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT SAFE_CONVERT_BYTES_TO_STRING(
    ARRAY_TO_STRING(ARRAY_AGG(
        IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
      ), b''))
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i 
));
SELECT 
  column_name, 
  URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`

可以通过下面的问题示例进行测试

#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT SAFE_CONVERT_BYTES_TO_STRING(
    ARRAY_TO_STRING(ARRAY_AGG(
        IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
      ), b''))
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i 
));
WITH `project.dataset.table` AS (
  SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT 
  URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
  column_name
FROM `project.dataset.table`    

有结果

Row url                                     column_name  
1   http://www.example.com/hello?v=12345    http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz   
  

使用进一步优化的SQL UDF更新

CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT STRING_AGG(
    IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'), 
      SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), '' 
    ORDER BY i
    )
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
  WITH OFFSET AS i 
));

答案 3 :(得分:1)

我与这里的每个人都同意,URLDECODE应该是本机函数。但是,在此之前,可以编写一个“本机” URLDECODE

SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
  id,
  ARRAY_AGG(CASE
    WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
    ELSE CAST(y AS bytes)
  END ORDER BY i) AS ps
  FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
  CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);

在此示例中,我尝试并使用来自Wikipedia的几个百分比编码的页面名称作为输入来测试和测试实现。它也应该与您的输入配合使用。

很明显,这是非常无用的!因此,我建议您建立一个物化联接表,或将其包装在视图中,而不要在查询中使用“裸”表达式。但是,它似乎确实可以完成工作,并且没有达到UDF限制。

编辑:@MikhailBerylyant's post below已将此繁琐的实现包装到一个漂亮,整洁的SQL UDF中。这是处理此问题的更好方法!