通过Logstash将{ElasticSearch索引从场爆炸转换为嵌套文档

时间:2018-01-29 20:29:27

标签: elasticsearch migration mapping logstash nested-documents

所以我们有一个古老的弹性搜索索引,它已经屈服于现场爆炸。我们重新设计了索引的结构,以使用嵌套文档来修复此问题。但是,我们试图弄清楚如何将旧索引数据迁移到新结构中。我们目前正在考虑使用Logstash插件,特别是聚合插件,来尝试实现这一目标。但是,我们可以找到的所有示例都显示了如何从数据库调用创建嵌套文档,而不是从字段爆炸索引创建嵌套文档。对于上下文,以下是旧索引的示例:

"assetID": 22074,
"metadata": {
  "50": {
    "analyzed": "Phase One",
    "full": "Phase One",
    "date": "0001-01-01T00:00:00"
  },
  "51": {
    "analyzed": "H 25",
    "full": "H 25",
    "date": "0001-01-01T00:00:00"
  },
  "58": {
    "analyzed": "50",
    "full": "50",
    "date": "0001-01-01T00:00:00"
  }
}

以下是我们希望转化后的数据最终看起来的样子:

"assetID": 22074,
"metadata": [{
    "metadataId": 50,
    "ngrams": "Phase One", //This was "analyzed"
    "alphanumeric": "Phase One", //This was "full"
    "date": "0001-01-01T00:00:00"
  }, {
    "metadataId": 51,
    "ngrams": "H 25", //This was "analyzed"
    "alphanumeric": "H 25", //This was "full"
    "date": "0001-01-01T00:00:00"
  }, {
    "metadataId": 58,
    "ngrams": "50", //This was "analyzed"
    "alphanumeric": "50", //This was "full"
    "date": "0001-01-01T00:00:00"
  }
}]

作为一个愚蠢的例子,我们可以从聚合插件中找到:

input {
  elasticsearch {
    hosts => "my.old.host.name:9266"
    index => "my-old-index"
    query => '{"query": {"bool": {"must": [{"term": {"_id": "22074"}}]}}}'  
    size => 500
    scroll => "5m"
    docinfo => true
  }
}

filter {
   aggregate {
    task_id => "%{id}"

    code => "     
      map['assetID'] = event.get('assetID')
      map['metadata'] ||= []
      map['metadata'] << {
        metadataId => ? //somehow parse the Id out of the exploded field name "metadata.#.full",
        ngrams => event.get('metadata.#.analyzed'),
        alphanumeric => event.get('metadata.#.full'),
        date => event.get('metadata.#.date'),
      }
    "
    push_previous_map_as_event => true
    timeout => 150000
    timeout_tags => ['aggregated']    
  } 

   if "aggregated" not in [tags] {
    drop {}
  }

}

output {
  elasticsearch {
    hosts => "my.new.host:9266"
    index => "my-new-index"
    document_type => "%{[@metadata][_type]}"
    document_id => "%{[@metadata][_id]}"
    action => "update"
  }

  file {
    path => "C:\apps\logstash\logstash-5.6.6\testLog.log"
  }  
}

显然,上面的示例基本上只是伪代码,但是我们可以通过查看Logstash和ElasticSearch的文档以及聚合过滤器插件以及通常在其生命的一寸之内搜索事物来收集。

2 个答案:

答案 0 :(得分:0)

您可以使用事件对象,按摩它然后将其添加到新索引中。如下所示(logstash代码未经测试,您可能会发现一些错误。请在本节后检查工作的ruby代码):

 aggregate {
    task_id => "%{id}"

    code => "arr = Array.new()

map["assetID"] = event.get("assetID")

metadataObj = event.get("metadata")
metadataObj.to_hash.each do |key,value| 
  transformedMetadata = {} 
  transformedMetadata["metadataId"] = key  

  value.to_hash.each do |k , v|

    if k == "analyzed" then
       transformedMetadata["ngrams"] = v
    elsif k == "full" then
       transformedMetadata["alphanumeric"] = v
    else
       transformedMetadata["date"] = v
    end
  end
  arr.push(transformedMetadata)
end
  map['metadata'] ||= []
  map['metadata'] << arr

"

  }
}

尝试根据事件输入玩上面的内容,然后你就可以到达那里。这是一个有效的例子,您可以在问题中输入以供您参与:https://repl.it/repls/HarshIntelligentEagle

&#13;
&#13;
json_data = {"assetID": 22074,
"metadata": {
  "50": {
    "analyzed": "Phase One",
    "full": "Phase One",
    "date": "0001-01-01T00:00:00"
  },
  "51": {
    "analyzed": "H 25",
    "full": "H 25",
    "date": "0001-01-01T00:00:00"
  },
  "58": {
    "analyzed": "50",
    "full": "50",
    "date": "0001-01-01T00:00:00"
  }
}
}

arr = Array.new()
transformedObj = {}
transformedObj["assetID"] = json_data[:assetID]


json_data[:metadata].to_hash.each do |key,value|  
  transformedMetadata = {}
  transformedMetadata["metadataId"] = key  
  
  value.to_hash.each do |k , v|
  
    if k == :analyzed then
       transformedMetadata["ngrams"] = v
    elsif k == :full then
       transformedMetadata["alphanumeric"] = v
    else
       transformedMetadata["date"] = v
    end
  end
  arr.push(transformedMetadata)
end
transformedObj["metadata"] = arr

puts transformedObj
&#13;
&#13;
&#13;

答案 1 :(得分:0)

最后,我们使用ruby代码在脚本中解决它:

# Must use the input plugin for elasticsearch at version 4.0.2, or it cannot contact a 1.X index
input {
  elasticsearch {
    hosts => "my.old.host.name:9266"
    index => "my-old-index"
    query => '{
      "query": {
        "bool": {
          "must": [
            { "match_all": { } }
          ]
        }
      }
    }' 
    size => 500
    scroll => "5m"
    docinfo => true
  }
}

filter {
  mutate {
    remove_field => ['@version', '@timestamp']
  }
}

#metadata
filter {
  mutate {
    rename => { "[metadata]" => "[metadata_OLD]" }
  }

  ruby {
    code => "
      metadataDocs = []
      metadataFields = event.get('metadata_OLD')

      metadataFields.each { |key, value|
        metadataDoc = {
          'metadataID' => key.to_i,
          'date' => value['date']
        }

        if !value['full'].nil?
          metadataDoc[:alphanumeric] = value['full']
        end

        if !value['analyzed'].nil?
          metadataDoc[:ngrams] = value['analyzed']
        end

        metadataDocs << metadataDoc
      }

      event.set('metadata', metadataDocs)
    "
  }

  mutate {
    remove_field => ['metadata_OLD']
  }
}

output {
  elasticsearch {
    hosts => "my.new.host:9266"
    index => "my-new-index"
    document_type => "searchasset"
    document_id => "%{assetID}"
    action => "update"
    doc_as_upsert => true
  }
  file {
    path => "F:\logstash-6.1.2\logs\esMigration.log"
  }  
}