在BigQuery中提取字符串后的数字或字符串

时间:2018-08-22 14:40:08

标签: google-bigquery

我有几个1.000个URL,并且想要从URL参数中提取一些值。 下面是数据库中的一些示例:

[“ www.xxx.com?uci=6666&rci=fefw”]
[“ www.xxx.com?uci=61
[“ www.xxx.com?rci=62&uci=5536”]
[“ www.xxx.com?uci=6666&utm_source=XXX”]
[“ www.xxx.com?pccst=TEST%20sTESTg”]
[“ www.xxx.com?pccst=TEST2%20s&uci=1”]
[“ www.xxx.com?uci=1pccst=TEST42rt24&rci=2”]

如何提取参数UCI的值。它始终是一个数字(不知道确切的长度)。 我尝试过REGEXP_EXTRACT。但是我没有成功:

REGEXP_EXTRACT(URL, '(uci)\=[0-9]+') AS UCI_extract

我还想提取参数pccst的值。可能是每个字符,我不知道确切的长度。但它总是以“或?”结尾或&

我也尝试过REGEXP_EXTRACT,但没有成功:

REGEXP_EXTRACT(URL, r'pccst\=(.*)(\"|\&|\?)') AS pccst_extract

我真的不是REGEX专家。 如果有人可以帮助我,那太好了。 提前多谢 彼得

2 个答案:

答案 0 :(得分:2)

您可以采用this解决方案

#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# @see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
  SELECT 1   AS id, 'www.xxx.com?uci=6666&rci=fefw' AS query 
  UNION ALL SELECT 2, 'www.xxx.com?uci=1pccst%20TEST42rt24&rci=2'
  UNION ALL SELECT 3, 'www.xxx.com?pccst=TEST2%20s&uci=1'
)
SELECT 
  id, 
  query,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM examples

enter image description here

答案 1 :(得分:1)

以下BigQuery标准SQL示例

#standardSQL
WITH `project.dataset.table` AS (
  SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
  SELECT "www.xxx.com?uci=61" UNION ALL
  SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
  SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
  SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
  SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
  SELECT "www.xxx.com?uci=1&pccst=TEST42rt24&rci=2" 
)
SELECT 
  url, 
  REGEXP_EXTRACT(url, r'[?&]uci=(.*?)(?:$|&)') uci,
  REGEXP_EXTRACT(url, r'[?&]pccst=(.*?)(?:$|&)') pccst
FROM `project.dataset.table`   

结果是

Row url                                         uci     pccst    
1   www.xxx.com?pccst=TEST%20sTESTg             null    TEST%20sTESTg    
2   www.xxx.com?pccst=TEST2%20s&uci=1           1       TEST2%20s    
3   www.xxx.com?uci=1&pccst=TEST42rt24&rci=2    1       TEST42rt24   
4   www.xxx.com?uci=61                          61      null     
5   www.xxx.com?rci=62&uci=5536                 5536    null     
6   www.xxx.com?uci=6666&rci=fefw               6666    null     
7   www.xxx.com?uci=6666&utm_source=XXX         6666    null        

此外,下面的选项可以解析所有键值对,因此您可以动态选择所需的

#standardSQL
WITH `project.dataset.table` AS (
  SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
  SELECT "www.xxx.com?uci=61" UNION ALL
  SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
  SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
  SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
  SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
  SELECT "www.xxx.com?uci=1pccst=TEST42rt24&rci=2" 
)
SELECT url, 
  ARRAY(
    SELECT AS STRUCT 
      SPLIT(kv, '=')[SAFE_OFFSET(0)] key, 
      SPLIT(kv, '=')[SAFE_OFFSET(1)] value 
    FROM UNNEST(SPLIT(SUBSTR(url, LENGTH(NET.HOST(url)) + 2), '&')) kv
  ) key_value_pair
FROM `project.dataset.table`