使用PostgreSQL从URL提取域

时间:2019-05-07 09:41:26

标签: regex postgresql url

我需要使用PostgreSQL提取网址列表的域名。在第一个版本中,我尝试使用REGEXP_REPLACE替换不需要的字符,例如www。,biz。,sports等,以获得域名。

 SELECT REGEXP_REPLACE(url, ^((www|www2|www3|static1|biz|health|travel|property|edu|world|newmedia|digital|ent|staging|cpelection|dev|m-staging|m|maa|cdnnews|testing|cdnpuc|shipping|sports|life|static01|cdn|dev1|ad|backends|avm|displayvideo|tand|static03|subscriptionv3|mdev|beta)\.)?', '') AS "Domain", 
 COUNT(DISTINCT(user)) AS "Unique Users"
 FROM db
 GROUP BY 1
 ORDER BY 2 DESC;

这似乎是不利的,因为需要不断更新查询以查找不需要的单词列表。

我确实尝试过https://stackoverflow.com/a/21174423/10174021使用PostgreSQL REGEXP_SUBSTR从行尾提取,但是作为回报,我得到了空白行。有更好的方法吗?

尝试使用的数据集示例:

 CREATE TABLE sample (
 url VARCHAR(100) NOT NULL);

 INSERT INTO sample url) 
 VALUES 
 ("sample.co.uk"),
 ("www.sample.co.uk"),
 ("www3.sample.co.uk"),
 ("biz.sample.co.uk"),
 ("digital.testing.sam.co"),
 ("sam.co"),
 ("m.sam.co");

所需的输出

+------------------------+--------------+
|    url                 |  domain      |
+------------------------+--------------+
| sample.co.uk           | sample.co.uk |
| www.sample.co.uk       | sample.co.uk |
| www3.sample.co.uk      | sample.co.uk |
| biz.sample.co.uk       | sample.co.uk |
| digital.testing.sam.co | sam.co       |
| sam.co                 | sam.co       |
| m.sam.co               | sam.co       |
+------------------------+--------------+

2 个答案:

答案 0 :(得分:1)

您可以尝试以下方法:

with tlds as (
     select * from (values('.co.uk'),('.co'),('.uk')) a(tld)
) ,
sample as (
    select * from (values ('sample.co.uk'),
                          ('www.sample.co.uk'),
                          ('www3.sample.co.uk'),
                          ('biz.sample.co.uk'),
                          ('digital.testing.sam.co'),
                          ('sam.co'),
                          ('m.sam.co')
                   ) a(url)
     ) 
  select url,regexp_replace(url,'(.*\.)(.*'||replace(tld,'.','\.')||')','\2') "domain" from (
            select distinct url,first_value(tld) over (PARTITION BY url order by length(tld) DESC) tld 
               from sample join tlds on (url like '%'||tld) 
         ) a

答案 1 :(得分:0)

因此,我已经使用Jeremy和RémyBaron的答案找到了解决方案。

  1. public suffix中提取所有公共后缀并存储到 我标记为tlds的表格。

  2. 获取数据集中的唯一URL并匹配其TLD。 part1

  3. 使用regexp_replace(此查询中使用)或备用regexp_substr(t1.url, '([a-z]+)(.)'||t1."tld")提取域名。最终输出:final_output

SQL查询如下:

WITH stored_tld AS(
SELECT 
DISTINCT(s.url),
FIRST_VALUE(t.domain) over (PARTITION BY s.url ORDER BY length(t.domain) DESC
                            rows between unbounded preceding and unbounded following) AS "tld" 
FROM sample s 
JOIN tlds t 
ON (s.url like '%%'||domain))

SELECT 
t1.url,
CASE WHEN t1."tld" IS NULL THEN t1.url ELSE regexp_replace(t1.url,'(.*\.)((.[a-z]*).*'||replace(t1."tld",'.','\.')||')','\2') 
END AS "extracted_domain" 
FROM(
    SELECT a.url,st."tld"
    FROM sample a
    LEFT JOIN stored_tld st
    ON a.url = st.url
    )t1

尝试链接:SQL Tester