Question

我的SQLite数据库中有一列10k的URI。我想确定哪些URI是同一网站的子域。

例如，对于给定的集合......

 1. daiquiri.rum.cu
 2. mojito.rum.cu
 3. cubalibre.rum.cu
 4. americano.campari.it
 5. negroni.campari.it
 6. hemingway.com

...我想运行一个返回的查询：

Website       | Occurrences
----------------------------
rum.cu        |     3
campari.it    |     2
hemingway.com |     1

即，匹配的域名/模式，按在数据库中找到的次数排名。

我将使用的启发式方法是：对于每个具有3个以上域的URI，将第一个域替换为'％'并执行伪查询：COUNT（来自网站的uris LIKE'％.remainderofmyuri'）。

请注意，我并不关心执行速度（事实上，根本不关心）。条目数在10k-100k范围内。

Answer 1

唯一的问题是找到域名。为了找到一个算法，想象你的网址前面有一个额外的点（比如＆＃39; .negroni.campari.it＆＃39;和＃39; .hemingway.com＆＃39;）。你看，它总是从右边的第二个点后面的字符串。我们所要做的就是查找该事件并删除字符串的一部分。然而不幸的是，SQLite的字符串函数相当差。没有任何功能可以让您第二次出现点，即使从左侧开始计数也没有。因此，对于大多数dbms而言，算法非常好，但对于SQLite来说并非如此。我们需要另一种方法。（无论如何，我正在写这篇文章，以表明如何解决这个问题。）

以下是SQLite解决方案：域和子域之间的区别在于域中只有一个点，而子域至少有两个。所以当有一个以上的点时，我们必须删除第一个部分，包括第一个点，以便进入域。此外，我们希望这甚至与abc.def.geh.ijk.com等子域一起工作，所以我们必须递归地执行此操作。

with recursive cte(uri) as 
(
  select uri from uris
  union all
  select substr(uri, instr(uri, '.') + 1) as uri from cte where instr(uri, '.') > 0
)
select uri, count(*)
from cte
where length(uri) = length(replace(uri,'.','')) + 1 -- domains only
group by uri
order by count(*) desc;

在这里，我们生成了＆＃39; daiquiri.rum.cu＆＃39;和＆＃39; rum.cu＆＃39;和＆＃39; cu＆＃39;来自＆＃39; daiquiri.rum.cu＆＃39;所以对于每个uri我们都得到了域名（这里是＆＃39; rum.cu＆＃39;）和其他一些字符串。最后，我们使用LENGTH过滤以获得只有一个点的字符串 - 域。其余的是分组和计数。

这是SQL小提琴：http://sqlfiddle.com/#!5/c1f35/37。

Answer 2

select x.site, count(*)
from mytable a
inner join 
(
    select 'rum.cu' as site
    union all select 'campari.it'
    union all select 'hemingway.com'
) x on a.url like '%' + x.site + '%'
group by x.site -- EDIT I missed out the GROUP BY on the first go - sorry!

（这就是我在SQL-Server中的表现;不确定SQLite在语法方面的差异。）

＆＃39; MYTABLE＆＃39;你的桌子上有一个名为url的列，其中包含＆＃39; mojito.rum.cu＆＃39;我没有把这个＆＃39;％。＆＃39;％。因为那会错过hemmingway.com。但是你可以通过使用这一行来解决这个问题：

) x on a.url like '%.' + x.site + '%' or a.url = x.site

你可能不需要fimal +＆＃39;％＆＃39; - 我把它放在像＃hemingway.com/some-page.html那样的网址上。如果你没有这样的网址，你可以跳过它。

编辑动态名称

select x.site, count(*)
from mytable a
inner join 
(
    select distinct ltrim(url, instr(url, '.')) as site
    from mytable
    where url like '%.%.%'
    union
    select distinct url
    from mytable
    where url like '%.%' and url not like '%.%.%'
) x on a.url like '%' + x.site + '%'
group by x.site

这样的事情应该这样做。我还没有测试INSTR（）函数是否正确。您可能需要在测试时从其生成的偏移量中添加或减去1。它可能不是最快的查询，但应该有效。

在SQL列中查找类似的条目并按频率排名

2 个答案: