我需要帮助来使用BigQuery解析网址。需要删除最后一个正斜杠“ /”之后的字符串/文本,然后返回URL。输入URL的长度可以因记录而异。如果输入的URL不包含域名地址后的字符串/文本,则应按原样返回该URL。
这里有一些例子。
输入Web网址
https://www.stackoverflow.com/questions
预期产量
我尝试使用SPLIT函数将URL字符串转换为ARRAY并使用ARRAY_LENGTH计算数组大小。但是,它并不涵盖我上面提到的所有各种情况。
请告知如何解决此问题?在BigQuery中使用标准SQL?
答案 0 :(得分:4)
我认为case
表达式有助于填补空白:
select (case when url like '%//%/%' then regexp_replace(url, '/[^/]+$', '')
else url
end)
from (select 'https://www.stackoverflow.com/questions/ask' as url union all
select 'https://www.stackoverflow.com/questions' as url union all
select 'https://www.stackoverflow.com' as url
) x;
答案 1 :(得分:2)
以下是用于BigQuery标准SQL
#standardSQL
SELECT url,
REPLACE(REGEXP_REPLACE(REPLACE(url, '//', '\\'), r'/[^/]+$', ''), '\\', '//')
FROM `project.dataset.table`
您可以使用问题中的示例数据来测试,玩转上面的示例
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'https://www.stackoverflow.com' url UNION ALL
SELECT 'https://www.stackoverflow.com/questions' UNION ALL
SELECT 'https://www.stackoverflow.com/questions/ask' UNION ALL
SELECT 'https://stackoverflow.com/questions/ask/some-text'
)
SELECT url,
REPLACE(REGEXP_REPLACE(REPLACE(url, '//', '\\'), r'/[^/]+$', ''), '\\', '//') value
FROM `project.dataset.table`
有结果
Row url value
1 https://www.stackoverflow.com https://www.stackoverflow.com
2 https://www.stackoverflow.com/questions https://www.stackoverflow.com
3 https://www.stackoverflow.com/questions/ask https://www.stackoverflow.com/questions
4 https://stackoverflow.com/questions/ask/some-text https://stackoverflow.com/questions/ask
答案 2 :(得分:2)
您可以在最后的“ /”及其后的字符串中使用简单的 REGEXP_REPLACE 。
SELECT REGEXP_REPLACE(url, r"([^/])/[^/]*$", "\\1")
FROM (SELECT 'https://www.stackoverflow.com/questions/ask' as url UNION ALL
SELECT 'https://www.stackoverflow.com/questions' as url UNION ALL
SELECT 'https://www.stackoverflow.com' as url
)
注意:\\ 1(第一个捕获组)代表“ /”之前的字符,我们需要考虑避免与“ //”匹配的字符。
测试结果:
答案 3 :(得分:0)
提供JavaScript UDF解决方案。不是因为这对于这种情况更好,而是在事情变得非常复杂时永远是您的最后希望。
(另外,我想指出的是,URL中可能存在双斜杠,例如:https://www.stackoverflow.com//questions//ask,以处理可能需要用JavaScript编码的额外逻辑)
#standardSQL
CREATE TEMP FUNCTION
remove_last_part_from_url(url STRING)
RETURNS STRING
LANGUAGE js AS """
var last_slash = url.lastIndexOf('/');
var first_double_slash = url.indexOf('//');
if (first_double_slash != -1
&& last_slash != -1
&& last_slash != first_double_slash + 1) {
return url.substr(0, last_slash);
}
return url;
""" ;
SELECT remove_last_part_from_url(url)
FROM (SELECT 'https://www.stackoverflow.com/questions/ask' as url UNION ALL
SELECT 'https://www.stackoverflow.com/questions' as url UNION ALL
SELECT 'https://www.stackoverflow.com//questions' as url UNION ALL -- double slash after https://
SELECT 'https:/invalid_url' as url UNION ALL
SELECT 'https://www.stackoverflow.com' as url
)