拥有以下数据(简单的srt)
1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final
2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.
...
在Elasticsearch中将其编入索引的最佳方法是什么?现在这里有一个问题:我希望搜索结果突出显示链接到时间戳指示的确切时间。此外,还有一些短语重叠多个srt行(例如上面示例中的final approach
)。
我的想法是
还是有另一个选项能够以优雅的方式解决这个问题吗?
答案 0 :(得分:3)
有趣的问题。这是我的看法。
本质上,字幕彼此“不知道”——这意味着最好在每个文档中包含前后的字幕文本 (n - 1
, n
, { {1}}) 适用时。
因此,您需要一个类似于以下内容的文档结构:
n + 1
为了达到这样的文档结构,我使用了以下内容(灵感来自 this excellent answer):
{
"sub_id" : 0,
"start" : "00:02:17,440",
"end" : "00:02:20,375",
"text" : "Senator, we're making our final",
"overlapping_text" : "Senator, we're making our final approach into Coruscant."
}
字幕解析完毕后,就可以将它们摄取到 ES 中。在此之前,请设置以下映射,以便您的时间戳可以正确搜索和排序:
from itertools import groupby
from collections import namedtuple
def parse_subs(fpath):
# "chunk" our input file, delimited by blank lines
with open(fpath) as f:
res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]
Subtitle = namedtuple('Subtitle', 'sub_id start end text')
subs = []
# grouping
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
sub_id, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
# ints only
sub_id = int(sub_id)
# join multi-line text
text = ', '.join(content)
subs.append(Subtitle(
sub_id,
start,
end,
text
))
es_ready_subs = []
for index, sub_object in enumerate(subs):
prev_sub_text = ''
next_sub_text = ''
if index > 0:
prev_sub_text = subs[index - 1].text + ' '
if index < len(subs) - 1:
next_sub_text = ' ' + subs[index + 1].text
es_ready_subs.append(dict(
**sub_object._asdict(),
overlapping_text=prev_sub_text + sub_object.text + next_sub_text
))
return es_ready_subs
完成后,继续摄取:
PUT my_subtitles_index
{
"mappings": {
"properties": {
"start": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
},
"end": {
"type": "text",
"fields": {
"as_timestamp": {
"type": "date",
"format": "HH:mm:ss,SSS"
}
}
}
}
}
}
摄取后,您可以定位原始副标题 from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from utils.parse import parse_subs
es = Elasticsearch()
es_ready_subs = parse_subs('subs.txt')
actions = [
{
"_index": "my_subtitles_index",
"_id": sub_group['sub_id'],
"_source": sub_group
} for sub_group in es_ready_subs
]
bulk(es, actions)
并在它直接匹配您的短语时对其进行提升。否则,请在 text
文本上添加后备,以确保返回两个“重叠”字幕。
在返回之前,您可以确保命中按 overlapping
升序排列。这种方式违背了提升的目的,但如果您进行排序,您可以在 URI 中指定 start
以确保也返回最初计算的分数。
综合起来:
track_scores:true
产量:
POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"text": {
"query": "final approach",
"boost": 2
}
}
},
{
"match_phrase": {
"overlapping_text": {
"query": "final approach"
}
}
}
]
}
},
"sort": [
{
"start.as_timestamp": {
"order": "asc"
}
}
]
}