我有几个1.000个URL,并且想要从URL参数中提取一些值。 下面是数据库中的一些示例:
[“ www.xxx.com?uci=6666&rci=fefw”]
[“ www.xxx.com?uci=61
[“ www.xxx.com?rci=62&uci=5536”]
[“ www.xxx.com?uci=6666&utm_source=XXX”]
[“ www.xxx.com?pccst=TEST%20sTESTg”]
[“ www.xxx.com?pccst=TEST2%20s&uci=1”]
[“ www.xxx.com?uci=1pccst=TEST42rt24&rci=2”]
如何提取参数UCI的值。它始终是一个数字(不知道确切的长度)。 我尝试过REGEXP_EXTRACT。但是我没有成功:
REGEXP_EXTRACT(URL, '(uci)\=[0-9]+') AS UCI_extract
我还想提取参数pccst的值。可能是每个字符,我不知道确切的长度。但它总是以“或?”结尾或&
我也尝试过REGEXP_EXTRACT,但没有成功:
REGEXP_EXTRACT(URL, r'pccst\=(.*)(\"|\&|\?)') AS pccst_extract
我真的不是REGEX专家。 如果有人可以帮助我,那太好了。 提前多谢 彼得
答案 0 :(得分:2)
您可以采用this解决方案
#standardSQL
# Extract query parameters from a URL as ARRAY in BigQuery; standard-sql; 2018-04-08
# @see http://www.pascallandau.com/bigquery-snippets/extract-url-parameters-array/
WITH examples AS (
SELECT 1 AS id, 'www.xxx.com?uci=6666&rci=fefw' AS query
UNION ALL SELECT 2, 'www.xxx.com?uci=1pccst%20TEST42rt24&rci=2'
UNION ALL SELECT 3, 'www.xxx.com?pccst=TEST2%20s&uci=1'
)
SELECT
id,
query,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM examples
答案 1 :(得分:1)
以下BigQuery标准SQL示例
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1&pccst=TEST42rt24&rci=2"
)
SELECT
url,
REGEXP_EXTRACT(url, r'[?&]uci=(.*?)(?:$|&)') uci,
REGEXP_EXTRACT(url, r'[?&]pccst=(.*?)(?:$|&)') pccst
FROM `project.dataset.table`
结果是
Row url uci pccst
1 www.xxx.com?pccst=TEST%20sTESTg null TEST%20sTESTg
2 www.xxx.com?pccst=TEST2%20s&uci=1 1 TEST2%20s
3 www.xxx.com?uci=1&pccst=TEST42rt24&rci=2 1 TEST42rt24
4 www.xxx.com?uci=61 61 null
5 www.xxx.com?rci=62&uci=5536 5536 null
6 www.xxx.com?uci=6666&rci=fefw 6666 null
7 www.xxx.com?uci=6666&utm_source=XXX 6666 null
此外,下面的选项可以解析所有键值对,因此您可以动态选择所需的
#standardSQL
WITH `project.dataset.table` AS (
SELECT "www.xxx.com?uci=6666&rci=fefw" url UNION ALL
SELECT "www.xxx.com?uci=61" UNION ALL
SELECT "www.xxx.com?rci=62&uci=5536" UNION ALL
SELECT "www.xxx.com?uci=6666&utm_source=XXX" UNION ALL
SELECT "www.xxx.com?pccst=TEST%20sTESTg" UNION ALL
SELECT "www.xxx.com?pccst=TEST2%20s&uci=1" UNION ALL
SELECT "www.xxx.com?uci=1pccst=TEST42rt24&rci=2"
)
SELECT url,
ARRAY(
SELECT AS STRUCT
SPLIT(kv, '=')[SAFE_OFFSET(0)] key,
SPLIT(kv, '=')[SAFE_OFFSET(1)] value
FROM UNNEST(SPLIT(SUBSTR(url, LENGTH(NET.HOST(url)) + 2), '&')) kv
) key_value_pair
FROM `project.dataset.table`