Question

我正在使用logstash来分析我的Web服务器访问。在这个时候，它运作良好。我使用了一个配置文件，它为我提供了这种数据：

{
  "type": "apache_access",
  "clientip": "192.243.xxx.xxx",
  "verb": "GET",
  "request": "/publications/boreal:12345?direction=rtl&language=en",
  ...
  "url_path": "/publications/boreal:12345",
  "url_params": {
    "direction": "rtl",
    "language": "end"
  },
  "object_id": "boreal:12345"
  ...
}

此记录存储在“logstash-2016.10.02”索引中（每天一个索引）。我还创建了另一个名为“publications”的索引。此索引包含发布元数据。 json记录如下所示：

{
   "type": "publication",
   "id": "boreal:12345",
   "sm_title": "The title of the publication",
   "sm_type": "thesis",
   "sm_creator": [
     "Smith, John",
     "Dupont, Albert",
     "Reegan, Ronald"
   ],
   "sm_departement": [
     "UCL/CORE - Center for Operations Research and Econometrics",
   ],
   "sm_date": "2001",
   "ss_state": "A"
   ...
}

我想创建一个像这样的查询，让我可以访问'史密斯，约翰'的出版物。因为我的所有数据都没有进入同一个索引，所以我不能使用父子关系（我是对的吗？）我在一个论坛上看到了这个，但这是一个老帖子：

By limiting itself to parent/child type relationships elasticsearch makes life 
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.

使用logstash，我无法将所有数据放在单个索引nammed logstash中。按月我有超过1M的访问权限...在1年内，我将在1个索引中拥有超过15M的记录......我需要将网络访问数据存储至少5年（1M * 12 * 15 = 180M）。我不认为处理包含超过18M记录的单个索引是个好主意（如果我错了，请告诉我）。

它是否存在我的问题的解决方案？我找不到任何漂亮的解决方案。我在python脚本中唯一有这个时间是：收集所有关于'Smith，John'出版物的id的第一个查询;每个发布上的循环，以获取此特定发布的所有WebServer访问权限。因此，如果“史密斯，约翰”有321个出版物，我向ES发送312个http请求，并且响应时间是不可接受的（超过7秒;当您知道ES中的记录数量但最终用户无法接受时，则不是很糟糕。）

感谢您的帮助;对不起我的英文

雷诺

Answer 1

一个想法是使用elasticsearch logstash filter以便在Logstash处理访问日志文档时获取给定的发布。

该过滤器将检索具有相同sm_creator的{{1}}索引中的publications字段，并使用您需要的发布文档中的任何字段丰富访问日志。此后，您只需查询logstash- * index。

object_id

因此，您的访问日志文档将在此后显示如下，并且“授予我对'Smith，John'出版物的所有访问权限”您只需查询elasticsearch { hosts => ["localhost:9200"] index => publications query => "id:%{object_id}" fields => {"sm_creator" => "author"} }所有logstash索引中的字段

sm_creator

使用父记录

1 个答案: