我有一个具有以下结构的Elasticsearch索引,并且样本数据如下所示
{
studentId:12345,
studentName:"abc",
age:10,
tests:[
{
testId:100,
score:70
},
{
testId:101,
score:60
}
]
}
然后,我有一个logstash实例每15分钟运行一次管道,它基于具有定义updated_time
的{{1}}字段,从mysqls上次运行以来已从更新行的mysql中获取学生记录。>
以下是表格结构
timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
和下面是我的logstash管道
--Student --StudentTest
id, name, age, updated_time id, student_id, test_id, score, updated_time
首先具有提到的学生12345的Elasticsearch中的数据,并在学生12345的表StudentTest中插入了一条新记录,其中test_id为102,得分为80,logstash管道将在15分钟后运行,并且如果有时间戳,则仅获取此新记录bcs并且这将覆盖带有测试100和101以及仅带有测试102的elasticsearch索引中已经存在的文档,我如何将ES索引中的现有数组与新插入的test_id 102记录合并以确保最终文档最终具有 像下面这样
input {
jdbc {
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost/mydb"
jdbc_user => ""
jdbc_password => ""
schedule => "*/15 * * * *"
tracking_column_type => "timestamp"
statement => "Select s.student_id, s.student_name, st.test_id, st.test_score
from student s
left join student_test st on s.id = st.student_id
and (s.updated_time > TIMESTAMP(current_timestamp()-INTERVAL 15 MINUTE) or
st.updated_time > TIMESTAMP(current_timestamp()-INTERVAL 15 MINUTE) );"
}
}
filter {
aggregate {
task_id => "%{student_id}"
code => "
map['studentId'] ||= event.get('student_id')
map['studentName'] ||= event.get('student_name')
map['tests'] ||= []
if (event.get('test_id') != nil)
map['tests'] << {
'id' => event.get('test_id'),
'score' => event.get('test_score')
}
end
event.cancel()
"
push_previous_map_as_event => true
timeout => 5
}
}
output {
elasticsearch {
document_id => "%{student_id}"
document_type => "_doc"
index => "students"
}
stdout{
codec => rubydebug
}
}