Question

我有一个标记的文本语料库存储在SQL表中，如下所示：

id  tag1 tag2  token  sentence_id                          
0     a    e   five        1
1     b    f  score        1
2     c    g  years        1
3     d    h    ago        1

我的任务是在表格中搜索符合某些标准的令牌序列，有时候每个令牌之间都有间隙。

例如：

我希望能够搜索类似于以下内容的序列：

令牌在a列中的值为tag1，
第二个令牌距离第一个令牌有一到两行，g中的值tag2或b中的tag1和
第三个令牌应该至少有三行，并且ago列中有token。

在SQL中，这将类似于以下内容：

SELECT * FROM my_table t1 
JOIN my_table t2 ON t1.sentence_id = t2.sentence_id 
JOIN my_table t3 ON t3.sentence_id = t1.sentence_id 
WHERE t1.tag1 = 'a' AND (t2.id = t1.id + 1 OR t2.id = t1.id + 2) 
AND (t2.tag2 = 'g' OR t2.tag1 = 'b') 
AND t3.id >= t1.id + 3 AND t3.token = 'ago'

到目前为止，我只能通过每次在序列中指定一个新标记（例如JOIN my_table t4）时自己加入表来实现这一点，但是数百万行这会变得很慢。有没有更有效的方法来做到这一点？

Answer 1

您需要编辑您的问题，并提供有关这些令牌序列如何工作的更多详细信息（例如，“每次我在序列中指定新令牌时”的含义是什么意思？）。

在postgresql中，您可以使用window function解决此类查询。按照上面的确切说明：

lead()

sentence_id函数从窗口框架中的当前行向前看多行（默认值为1，未指定时），在这种情况下，所有行都具有相同的lead(tag1, 2)窗口定义的分区。因此，tag1会查看前两行lead(token, 2)的值，以便与您的条件进行比较，而token会从前两行返回next_token作为列sentence_id 在当前行并具有相同的CASE。如果第一个NULL条件失败，则评估第二个CASE条件;如果失败则返回from django.utils.translation import ugettext_lazy as _ from rest_framework.exceptions import ValidationError from rest_framework.utils.representation import smart_repr from rest_framework.compat import unicode_to_repr class RequiredValidator(object): missing_message = _('This field is required') def __init__(self, fields): self.fields = fields def enforce_required_fields(self, attrs): missing = dict([ (field_name, self.missing_message) for field_name in self.fields if field_name not in attrs ]) if missing: raise ValidationError(missing) def __call__(self, attrs): self.enforce_required_fields(attrs) def __repr__(self): return unicode_to_repr('<%s(fields=%s)>' % ( self.__class__.__name__, smart_repr(self.fields) ))。请注意，class MyUserRegistrationSerializer(serializers.ModelSerializer): class Meta: model = User fields = ( 'email', 'first_name', 'password' ) validators = [ RequiredValidator( fields=('email', 'first_name', 'password') ) ]子句中条件的顺序很重要：不同的顺序会产生不同的结果。

显然，如果继续为后续令牌添加条件，查询会变得非常复杂，您可能必须将单独的搜索条件放在单独的存储过程中，然后根据您的要求调用它们。

Answer 2

您可以尝试这种分阶段的方法：

将每个条件（除了各种距离条件之外）应用为子查询
计算满足条件的令牌之间的距离
分别应用所有距离条件。

如果您在tag1，tag2和token列上有索引，可能会改善一些事情：

SELECT DISTINCT sentence_id FROM
(
  -- 2. Here we calculate the distances
  SELECT cond1.sentence_id,
  (cond2.id - cond1.id) as cond2_distance,
  (cond3.id - cond1.id) as cond3_distance
  FROM
  -- 1. These are all the non-distance conditions
  (
    SELECT * FROM my_table WHERE tag1 = 'a'
  ) cond1
  INNER JOIN
  (
    SELECT * FROM my_table WHERE 
    (tag1 = 'b' OR tag2 = 'g')
  ) cond2
  ON cond1.sentence_id = cond2.sentence_id
  INNER JOIN
  (
    SELECT * FROM my_table WHERE token = 'ago'
  ) cond3
  ON cond1.sentence_id = cond3.sentence_id
) conditions
-- 3. Now apply the distance conditions
WHERE cond2_distance BETWEEN 0 AND 2
AND cond3_distance >= 3
ORDER BY sentence_id;

如果您将此查询应用于this SQL fiddle，则会获得：

| sentence_id |
|-------------|
|           1 |
|           4 |

这是你想要的。现在无论它是否更快，只有你（有你的百万行数据库）才能真正说明，但从必须实际写这些查询的角度来看，你可以发现他们更容易阅读，理解和维护。

SQL：从表中选择行序列的最有效方法

2 个答案: