Question

我在Postgres中解析网址时遇到问题。我有一个充满客户和数据库的数据库。我需要一组与每个客户相关的独特域名。我希望能够在我的查询中进行解析，而不是将我的结果转储到Python并在那里解析它。

在postgres文档中我找到了这个，但无法弄清楚如何将它合并到我的查询中：

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');  

  alias   |  description  |            token               
----------+---------------+------------------------------  
 protocol | Protocol head | http://  
 url      | URL           | example.com/stuff/index.html  
 host     | Host          | example.com  
 url_path | URL path      | /stuff/index.html

（http://www.postgresql.org/docs/9.3/static/textsearch-parsers.html）

我从一张桌子开始，就像这样：

customer_id | url 
-------------+--------------------   
000001      | www.example.com/fish  
000001      | www.example.com/potato  
000001      | www.potato.com/artichoke
000002      | www.otherexample.com

到目前为止我的代码：

SELECT customer_id, array_agg(url)
FROM customer_url_table
GROUP BY customer_id

这给了我：

customer_id | unique_domains
-----------------------------
000001      | {www.example.com/fish, www.example.com/potato, www.potato.com/greenery}
000002      | {www.otherexample.com}

我想要一张这样的桌子：

customer_id | unique_domains
-----------------------------
000001      | {example.com, potato.com}
000002      | {otherexample.com}

使用位于AWS上的PostgreSQL 9.3.3数据库。

Answer 1

您上面链接的文档用于 Postgres文本搜索解析器。这需要单独的配置来设置，并且可能比您正在寻找的更多开销和/或不同类型的东西。

如果你想要去那条路线，要设置文本解析器，你可以在这里找到更多信息：

http://www.postgresql.org/docs/9.3/static/sql-createtsconfig.html

但是，如果您想在 Postgres 中进行内联解析，我建议使用过程 Postgres 语言，您可以使用该语言导入解析库。 / p>

您提到了 Python ，因此您可以使用 PL / Python 和url解析库，例如 urlparse （称为 urllib。在Python 3中解析。

More info about urlparse

包含此示例代码：

>>> from urlparse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

超越该示例，您可以使用 hostname 成员获取主机名：

>>> print o.hostname
www.cwi.nl

如果你想要正确地解析出域名（有很多边缘案例和变体 - 即减去 www 以及可能存在的任何其他各种各样的部分 - 一种方法，如在this answer中最好。

有关设置 PL / Python 的更多信息，请访问：

http://www.postgresql.org/docs/9.3/static/plpython.html

那么，你可以在 Postgres中进行解析

而不是将我的结果转储到Python并在那里解析它

最终用 PL / Python 进行了一个完整的循环，但是如果你真的想在SQL中进行解析（特别是出于性能原因，比如跨大型数据集），与 PL / Python 一起使用可能值得付出额外的努力。

在Postgres中解析URL

1 个答案: