Question

我想采用一组相对适度的URL，并使用一个（或多个，如果需要的话）PostgreSQL查询将它们解析为顶级域名。

执行此操作的主要步骤似乎如下：

在网址
如果'/'的数量是＆gt; 3，然后删除所有内容第三个'/'（包括最后一个'/'）
计算'。'的数量。出现在步骤1-2中的URL中。
如果＃是'。'是1，然后简单地删除'：//'之前的任何内容。
如果＃是'。' ＆GT; 1，然后找到最大的'。'然后提取第一个“。”之间的文字。和新的字符串长度。

我能找到几个例子：（a）http://www.postgresql.org/message-id/247444.36947.qm@web50311.mail.re2.yahoo.com （b）http://www.seanbehan.com/extract-domain-names-from-links-in-text-with-postgres-and-a-single-sql-query

但是这些似乎都没有正常工作 - 我正在查询一个redshift数据库，当我尝试执行时，我得到一个'Function not Implemented'错误。

虽然有很多方法可以用Python或其他语言来实现，但我还是找不到专门用于PostgreSQL的SO解决方案。

Answer 1

假设您的网址有一个方案，您是否尝试过类似的方式：

select substring( 'http://www.arandomsite.com' from '^[^:]*://(?:[^/:]*:[^/@]*@)?(?:[^/:.]*\.)+([^:/]+)' ) as tld;

细节：

^        # anchor for the start of the string
[^:]*:// # the scheme
(?:[^/:]*:[^/@]*@)? # optional "user:password@"
(?:[^/:.]*\.)+ # other parts of the hostname
([^:/]+) # tld (note that the ":" is excluded too, to avoid to match the port)

注意：如果url有ipv4或ipv6作为主机名，显然这不起作用。

Answer 2

没有，但是可以肯定的是，功能强大且快速：

select translate(split_part('https://developer.twitter.com/en/portal/projects/123/apps', '/', 3), '.', ' ');
>  developer twitter com

很好地转储到ts_vector

如何使用PostgreSQL

2 个答案: