Question

我有如下事件数据：

 id | instance_id | value
 1  | 1           | a
 2  | 1           | ap
 3  | 1           | app
 4  | 1           | appl
 5  | 2           | b
 6  | 2           | bo
 7  | 1           | apple
 8  | 2           | boa
 9  | 2           | boat
10  | 2           | boa
11  | 1           | appl
12  | 1           | apply

基本上，每一行都是用户键入一个新字母。他们还可以删除字母。

我想创建一个看起来像这样的数据集，我们称它为data

 id | instance_id | value
 7  | 1           | apple
 9  | 2           | boat
12  | 1           | apply

我的目标是提取每个实例中的所有完整单词，并考虑删除的原因-因此仅获取最长的单词或最新输入的单词是不够的。

为此，我打算像这样进行正则表达式操作：

select * from data
where not exists (select * from data d2 where d2.value ~ (d.value || '.'))

有效地，我正在尝试构建一个动态正则表达式，以添加比当前字符多一个字符的匹配项，并且特定于与其匹配的行。

上面的代码似乎不起作用。在Python中，我可以在使用前“编译”正则表达式模式。 PostgreSQL在动态构建模式方面的等效功能是什么？

Answer 1

尝试使用简单的LIKE运算符代替正则表达式模式：

SELECT * FROM data d1
WHERE NOT EXISTS (
  SELECT * FROM data d2
  WHERE d2.value LIKE d1.value ||'_%'
)

演示：https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=cd064c92565639576ff456dbe0cd5f39

在value列上创建索引，这将加快查询速度。

Answer 2

在顺序数据window functions中查找峰是一个不错的选择。您只需要使用lag() and lead() functions将每个值与上一个和下一个进行比较：

with cte as (
  select 
    *, 
    length(value) > coalesce(length(lead(value) over (partition by instance_id order by id)),0) and
    length(value) > coalesce(length(lag(value) over (partition by instance_id order by id)),length(value)) as is_peak
  from data)
select * from cte where is_peak order by id;

Demo

PostgreSQL：动态正则表达式模式

2 个答案: