我有一个PostgreSQL表,我们称它为 tokens ,它在文本行中包含每个令牌的语法注释,基本上是这样的:
idx | line | tno | token | annotation | lemma
----+------+-----+---------+-----------------+---------
1 | I.01 | 1 | This | DEM.PROX | this
2 | I.01 | 2 | is | VB.COP.3SG.PRES | be
3 | I.01 | 3 | an | ART.INDEF | a
4 | I.01 | 4 | example | NN.INAN | example
我想进行查询,以搜索语法上下文,在这种情况下,该查询检查在当前语言前后大小为 n 的窗口中是否存在特定注释行。根据我对此的了解,PostgreSQL的窗口函数LEAD
和LAG
适合实现这一目标。首先,我根据可以找到的有关这些功能的文档编写了以下查询:
SELECT *
FROM (
SELECT token, annotation, lemma,
-- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
LEAD(annotation) OVER next_rows AS next_anno
FROM tokens
WINDOW next_rows AS (
ORDER BY line, tno ASC
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".next_anno LIKE '...'
;
但是,这仅搜索以下2行。我的问题是,如何重新定义查询以使窗口包括表中的上一行和下一行?显然,我不能有2条WINDOW
语句或做类似的事情
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
答案 0 :(得分:3)
我不确定我是否正确使用了您的用例:您想检查一个给定的批注是否在5行(前2行,当前,2行)之一中。是吗?
BETWEEN 2 PRECEDING AND 2 FOLLOWING
的窗口LEAD
或LAG
仅给出一个值,在这种情况下,该值在当前行之后或之前(如果窗口支持);无论您的窗口包含多少行。但是您想签入这五行中的任何一行。一种实现此目的的方法:
SELECT *
FROM (
SELECT token, annotation, lemma,
unnest(array_agg(annotation) OVER w) as surrounded_annos -- 2
FROM tokens
WINDOW w AS ( -- 1
ORDER BY line, tno ASC
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".surrounded_annos LIKE '...'
;
array_agg
聚集这五行中的所有注释(如果可能的话),从而给出一个数组unnest
将每个元素的数组扩展为一行,因为恕我直言,无法使用LIKE
搜索数组元素。这将为您提供此结果(可以在下一步中进行过滤):结果子查询:
token annotation lemma surrounded_annos
This DEM.PROX this DEM.PROX
This DEM.PROX this VB.COP.3SG.PRES
This DEM.PROX this ART.INDEF
is VB.COP.3SG.PRES be DEM.PROX
is VB.COP.3SG.PRES be VB.COP.3SG.PRES
is VB.COP.3SG.PRES be ART.INDEF
is VB.COP.3SG.PRES be NN.INAN
an ART.INDEF a DEM.PROX
an ART.INDEF a VB.COP.3SG.PRES
an ART.INDEF a ART.INDEF
an ART.INDEF a NN.INAN
example NN.INAN example VB.COP.3SG.PRES
example NN.INAN example ART.INDEF
example NN.INAN example NN.
答案 1 :(得分:0)
另一种方法是计算句子中每个标记的相对位置,并执行标记<->标记的自联接(这将使您可以选择基于 skip-grams 的标记在距离上):
WITH www AS ( -- enumerate word posision with sentences
SELECT line, tno -- candidate key
, row_number() OVER sentence AS rn
FROM tokens
WINDOW sentence AS ( ORDER BY line ASC, tno ASC)
)
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, w1.rn - w0.rn AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
JOIN www w0 ON t0.line = w0.line AND t0.tno= w0.tno -- PK1
JOIN www w1 ON t1.line = w1.line AND t1.tno= w1.tno -- PK2
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;
-- But, if you rno is consecutive(gapless) within lines,
-- you can omit the enumeration step, and do a plain self-join:
SELECT t0.line AS line
, t0.token AS this
, t1.tno AS tno
, t1.tno - t0.tno AS rel -- relative position
, t1.token AS that
, t1.annotation AS anno
FROM tokens t0
JOIN tokens t1 ON t1.line = t0.line -- same sentence
WHERE 1=1
AND t0.lemma LIKE 'be'
-- AND t1.annotation LIKE '.PROX' AND w1.rn - w0.rn = -1
;