我需要使用PostgreSQL提取网址列表的域名。在第一个版本中,我尝试使用REGEXP_REPLACE替换不需要的字符,例如www。,biz。,sports等,以获得域名。
SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain",
COUNT(DISTINCT(user)) AS "Unique Users"
FROM db
GROUP BY 1
ORDER BY 2 DESC;
这似乎是不利的,因为需要不断更新查询以查找不需要的单词列表。
我确实尝试过https://stackoverflow.com/a/21174423/10174021使用PostgreSQL REGEXP_SUBSTR从行尾提取,但是作为回报,我得到了空白行。有更好的方法吗?
尝试使用的数据集示例:
CREATE TABLE sample (
url VARCHAR(100) NOT NULL);
INSERT INTO sample url)
VALUES
("sample.co.uk"),
("www.sample.co.uk"),
("www3.sample.co.uk"),
("biz.sample.co.uk"),
("digital.testing.sam.co"),
("sam.co"),
("m.sam.co");
所需的输出
+------------------------+--------------+
| url | domain |
+------------------------+--------------+
| sample.co.uk | sample.co.uk |
| www.sample.co.uk | sample.co.uk |
| www3.sample.co.uk | sample.co.uk |
| biz.sample.co.uk | sample.co.uk |
| digital.testing.sam.co | sam.co |
| sam.co | sam.co |
| m.sam.co | sam.co |
+------------------------+--------------+
答案 0 :(得分:1)
您可以尝试以下方法:
with tlds as (
select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
select * from (values ('sample.co.uk'),
('www.sample.co.uk'),
('www3.sample.co.uk'),
('biz.sample.co.uk'),
('digital.testing.sam.co'),
('sam.co'),
('m.sam.co')
) a(url)
)
select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld
from sample join tlds on (url like '%'||tld)
) a
答案 1 :(得分:0)
因此,我已经使用Jeremy和RémyBaron的答案找到了解决方案。
从public suffix中提取所有公共后缀并存储到 我标记为tlds的表格。
regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld")
提取域名。最终输出: SQL查询如下:
WITH stored_tld AS(
SELECT
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
rows between unbounded preceding and unbounded following) AS "tld"
FROM sample s
JOIN tlds t
ON (s.url like '%%'||domain))
SELECT
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2')
END AS "extracted_domain"
FROM(
SELECT a.url,st."tld"
FROM sample a
LEFT JOIN stored_tld st
ON a.url = st.url
)t1
尝试链接:SQL Tester