Logstash从MySQL导入一对多导入

时间:2017-04-25 08:34:25

标签: mysql elasticsearch logstash

我尝试从两个MySQL表(作业数据和位置)导入作业广告,但是当作业广告有多个位置时我遇到了问题。我正在使用这个MySQL查询:

SELECT id, company, jobtitle, description, priority, DATE_FORMAT(date, '%Y-%m-%d %T') AS date, sa_locations.location AS location_name, sa_locations.lat AS location_lat, sa_locations.lon AS location_lon FROM sa_data JOIN sa_locations ON sa_data.id = sa_locations.id ORDER BY id

忽略位置问题一切正常,我收到这样的结果:

{
     "_index" : "jk",
     "_type" : "jobposting",
     "_id" : "26362",
     "_score" : 1.0,
     "_source" : {
       "date" : "2017-04-22 00:00:00",
       "location_name" : "Berlin",
       "location_lat" : "52.520007",
       "location_lon" : "13.404954",
       "@timestamp" : "2017-04-24T07:50:31.660Z",
       "@version" : "1",
       "description" : "Some text here",
       "company" : "Test Company",
       "id" : 26362,
       "jobtitle" : "Architect Data Center Network & Security",
       "priority" : 10,
 },  {
     "_index" : "jk",
     "_type" : "jobposting",
     "_id" : "26363",
     "_score" : 1.0,
     "_source" : {
       "date" : "2017-04-22 00:00:00",
       "location_name" : "Hamburg",
       "location_lat" : "53.551085",
       "location_lon" : "9.993682",
       "@timestamp" : "2017-04-24T07:50:31.660Z",
       "@version" : "1",
       "description" : "Some text here",
       "company" : "Test Company",
       "id" : 26363,
       "jobtitle" : "Architect Data Center Network & Security",
       "priority" : 10,
 }

我想要的是这样的:

 {
     "_index" : "jk",
     "_type" : "jobposting",
     "_id" : "26362",
     "_score" : 1.0,
     "_source" : {
       "date" : "2017-04-22 00:00:00",
       "locations" : [ {  "name": "Berlin", "lat" : "52.520007", "lon" : "13.04954" }, {  "name": "Hamburg", "lat" : "53.551085", "lon" :
 "9.993682" } ]
       "@timestamp" : "2017-04-24T07:50:31.660Z",
       "@version" : "1",
       "description" : "Some text here",
       "company" : "Test Company",
       "id" : 26362,
       "jobtitle" : "Architect Data Center Network & Security",
       "priority" : 10,
  }

因此,如果我要使用geo_distance过滤器搜索柏林或汉堡附近的工作,则应显示此作业。有没有办法用logstash以这种方式导入数据?

我的logstash.conf看起来像这样:

input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost:3306/jk"
jdbc_user => "..."
jdbc_password => "..."
jdbc_driver_library => "/etc/logstash/mysql-connector-java-5.1.41/mysql-connector-java-5.1.41-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT id, company, jobtitle, description, priority, DATE_FORMAT(date, '%Y-%m-%d %T') AS date, sa_locations.location AS location_name, sa_locations.lat AS location_lat, sa_locations.lon AS location_lon
FROM sa_data JOIN sa_locations
ON sa_data.id = sa_locations.id
ORDER BY id
}
}

#filter {
# aggregate {
# task_id => "%{id}"
# code => "
# map['location_name'] = event.get('location_name')
# map['location_lat'] = event.get('location_lat')
# map['location_lon'] = event.get('location_lon')
# map['locations'] ||= []
# map['locations'] < event.get('location_name')}
# map['locations'] < event.get('location_lat')}
# map['locations'] < event.get('location_lon')}
# event.cancel()
# "
# push_previous_map_as_event => true
# timeout => 3
# }
#}

output {
elasticsearch {
index => "jk"
document_type => "jobposting"
document_id => "%{id}"
hosts => ["localhost:9200"]
}
}

过滤器似乎是一种错误的方法。

1 个答案:

答案 0 :(得分:2)

如果单个ID有多个位置,您仍然希望聚合,但是您当前的设置不会为每个位置创建一个哈希数组(位置数据库中每行的一个哈希值)。

您可以这样做:

filter {
  mutate {
    rename => { 'location_name' => '[location][name]' }
    rename => { 'location_lat' => '[location][lat]' }
    rename => { 'location_long' => '[location][long]' }
  }

  aggregate {
    task_id => '%{id}'
    code => "
      map['locations'] ||= []
      map['locations'] << event.get('location')
    "
    push_previous_map_as_event => true
  }
}