如何查询文本以在SQL中查找最长的前缀字符串?

时间:2017-02-06 23:18:17

标签: sql apache-spark-sql ansi-sql

我正在使用sparq sql。让我们说这是我的大桌子的快照:

ups store
ups store austin
ups store chicago
ups store bern
walmart
target

如何在sql中找到上述数据的最长前缀?那就是:

 ups store
 walmart
 target

我已经有了一个Java程序来执行此操作,但我有一个大文件,现在我的问题是 如果可以在SQL中合理地完成这个吗?

以下更复杂的scnenario怎么样? (如果没有这个,我可以活着,但如果可能的话,我很高兴)

ups store austin
ups store chicago
ups store bern
walmart
target

那会返回[ups store, walmart, target]

2 个答案:

答案 0 :(得分:1)

假设您可以自由创建另一个表,该表只包含从零到最长可能字符串大小的升序整数列表,那么以下内容应仅使用ANSI SQL执行作业:

SELECT
  id,
  SUBSTRING(name, 1, CASE WHEN number = 0 THEN LENGTH(name) ELSE number END) AS prefix
FROM
 -- Join all places to all possible substring lengths.
 (SELECT *
  FROM places p
  CROSS JOIN lengths l) subq
-- If number is zero then no prefix match was found elsewhere
-- (from the question it looked like you wanted to include these)
WHERE (subq.number = 0 OR
       -- Look for prefix match elsewhere
       EXISTS (SELECT * FROM places p
               WHERE SUBSTRING(p.name FROM 1 FOR subq.number)
                     = SUBSTRING(subq.name FROM 1 FOR subq.number)
                 AND p.id <> subq.id))
  -- Include as a prefix match if the whole string is being used
  AND (subq.number = LENGTH(name)
       -- Don't include trailing spaces in a prefix
       OR (SUBSTRING(subq.name, subq.number, 1) <> ' '
           -- Only include the longest prefix match 
           AND NOT EXISTS (SELECT * FROM places p 
                           WHERE SUBSTRING(p.name FROM 1 FOR subq.number + 1)
                                 = SUBSTRING(subq.name FROM 1 FOR subq.number + 1)
                             AND p.id <> subq.id)))
ORDER BY id;

现场演示: http://rextester.com/XPNRP24390

  

第二个方面是,如果我们有(ups存储奥斯汀,ups商店   芝加哥)。我们可以使用SQL从中提取'ups store'。

这应该只是以与上述类似的方式使用SUBSTRING的情况,例如:

SELECT SUBSTRING(name,
                 LENGTH('ups store ') + 1,
                 LENGTH(name) - LENGTH('ups store '))
FROM places
WHERE SUBSTRING(name,
                1,
                LENGTH('ups store ')) = 'ups store ';

答案 1 :(得分:0)

假设您的列名是&#34; mycolumn&#34;,并且您的大表是&#34; mytable&#34;,并且单个空格是您的字段分隔符:

在PostgreSQL中,你可以做一些简单的事情:

select
   mycolumn
from
   mytable
order by
   length(split_part(mycolumn, ' ', 1)) desc
limit
   1

如果您经常运行此查询,我可能会在表上尝试一个有序的功能索引,如下所示:

create prefix_index on mytable (length(split_part(mycolumn, ' ', 1)) desc)