在elasticsearch中搜索字幕数据

时间:2015-02-10 12:21:36

标签: elasticsearch

拥有以下数据(简单的srt)

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...

在Elasticsearch中将其编入索引的最佳方法是什么?现在这里有一个问题:我希望搜索结果突出显示链接到时间戳指示的确切时间。此外,还有一些短语重叠多个srt行(例如上面示例中的final approach)。

我的想法是

  • 将srt文件索引为列表类型,时间戳为索引。我相信这不会与重叠多个键的短语相匹配
  • 创建仅对文本部分编制索引的自定义标记生成器。我不确定弹性搜索会如何突出原始内容。
  • 仅索引文本部分并将其映射回elasticsearch之外的时间戳

还是有另一个选项能够以优雅的方式解决这个问题吗?

1 个答案:

答案 0 :(得分:3)

有趣的问题。这是我的看法。

本质上,字幕彼此“不知道”——这意味着最好在每个文档中包含前后的字幕文本 (n - 1, n, { {1}}) 适用时。

因此,您需要一个类似于以下内容的文档结构:

n + 1

为了达到这样的文档结构,我使用了以下内容(灵感来自 this excellent answer):

{
  "sub_id" : 0,
  "start" : "00:02:17,440",
  "end" : "00:02:20,375",
  "text" : "Senator, we're making our final",
  "overlapping_text" : "Senator, we're making our final approach into Coruscant."
}

字幕解析完毕后,就可以将它们摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可以正确搜索和排序:

from itertools import groupby
from collections import namedtuple


def parse_subs(fpath):
    # "chunk" our input file, delimited by blank lines
    with open(fpath) as f:
        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]

    Subtitle = namedtuple('Subtitle', 'sub_id start end text')

    subs = []

    # grouping
    for sub in res:
        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry
            sub = [x.strip() for x in sub]
            sub_id, start_end, *content = sub  # py3 syntax
            start, end = start_end.split(' --> ')

            # ints only
            sub_id = int(sub_id)

            # join multi-line text
            text = ', '.join(content)

            subs.append(Subtitle(
                sub_id,
                start,
                end,
                text
            ))

    es_ready_subs = []

    for index, sub_object in enumerate(subs):
        prev_sub_text = ''
        next_sub_text = ''

        if index > 0:
            prev_sub_text = subs[index - 1].text + ' '

        if index < len(subs) - 1:
            next_sub_text = ' ' + subs[index + 1].text

        es_ready_subs.append(dict(
            **sub_object._asdict(),
            overlapping_text=prev_sub_text + sub_object.text + next_sub_text
        ))

    return es_ready_subs

完成后,继续摄取:

PUT my_subtitles_index
{
  "mappings": {
    "properties": {
      "start": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      },
      "end": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      }
    }
  }
}

摄取后,您可以定位原始副标题 from elasticsearch import Elasticsearch from elasticsearch.helpers import bulk from utils.parse import parse_subs es = Elasticsearch() es_ready_subs = parse_subs('subs.txt') actions = [ { "_index": "my_subtitles_index", "_id": sub_group['sub_id'], "_source": sub_group } for sub_group in es_ready_subs ] bulk(es, actions) 并在它直接匹配您的短语时对其进行提升。否则,请在 text 文本上添加后备,以确保返回两个“重叠”字幕。

在返回之前,您可以确保命中按 overlapping 升序排列。这种方式违背了提升的目的,但如果您进行排序,您可以在 URI 中指定 start 以确保也返回最初计算的分数。

综合起来:

track_scores:true

产量:

POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "text": {
              "query": "final approach",
              "boost": 2
            }
          }
        },
        {
          "match_phrase": {
            "overlapping_text": {
              "query": "final approach"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "start.as_timestamp": {
        "order": "asc"
      }
    }
  ]
}