Question

I have system which is importing data (with same structure) via API every day and searches it for keywords. Every day new data appear, so I download it repetitively. After the searching of keywords, I save results somewhere else. I need only to keep data imported in last three months. I want to use elasticsearch for full-text search because of stemming and things. I need some advice regarding the structure of elastic db.

Is it better to create new index with timestamp in its name for every import and delete indexes older than 3 months or is it better to keep all data in one index even if I want to search only in newly imported data?

Answer 1

您想使用time-frame based indexing strategy。 Elasticsearch允许您使用index template轻松管理此操作，这样您就可以将所有数据添加到别名中。例如，您可以创建如下模板：

PUT _template/my_index_template
{
  "template":"my_index_*"
  "aliases:{"my_data":{}}
}

这意味着您可以对符合模式“my_index_ *”的任何索引发出文档插入请求（即它以my_index_开头）。如果您在索引名称中动态包含日期，这将使您的时间基于索引。例如2016年8月31日的数据应存储在my_index_20160831中，并且可以从上述定义中列出的别名进行搜索，例如：对my_data/_search的HTTP POST请求将返回您的时间框架索引的数据。

然后，您最终会存储大量索引，并且对_cat/indices的调用将开始如下所示：

my_index_20160829
my_index_20160830
my_index_20160831

现在，您可以使用策展人查找超过特定时间段的索引。这是一个命令行工具，允许您指定要删除的索引模式。要测试它，您可以使用命令：

curator show indices --prefix my_index --older-than 3 --time-unit months --timestring %Y%m%d

这将显示它要删除的所有索引，然后实际删除索引将show更改为delete

可以在index子命令here中找到更多信息。请注意，这是版本3.5的策展人。版本4的语法已更改。

Answer 2

嗯......实际上没有尝试过，但似乎合乎逻辑：

https://www.elastic.co/blog/curator-tending-your-time-series-indices

https://www.elastic.co/guide/en/elasticsearch/client/curator/current/about.html

感谢这篇文章，我将在需要时提供解决方案：）

Answer 3

有一天我遇到了同样的问题。如果有大约90个索引不多，那么我建议你做单独的索引。获取数据比使用第二个选项查询数据要快。

Elasticsearch structure for repetitive searching

3 个答案: