我们有一个名为employees的简单索引,其中我们只有2个字段firstname,lastname。使用logstash脚本,我们加载员工数据。我们不希望将重复记录存储到索引中,即使我们在数据文件中有重复项。在这种情况下,如果firstname + lastname相同,则不应将该记录添加到索引中。
logstash script is:
input {
file {
path => "C:/employees.csv"
}
}
filter {
csv {
columns => [
"firstname",
"lastname"
]
separator => ","
}
}
output {
elasticsearch{
hosts => ["localhost:9200"]
index => "employees"
}
}
data file - employees.csv
john,doe
jane,doe
john,doe - this record should not be added to the index.
I went through lot of documentation and searched a lot for adding conditions in the filter clause. however, no luck so far.
Can any one provide inputs on this.
thanks
答案 0 :(得分:1)
听起来您正在寻找Elasticsearch映射_id
字段。如果您根据每行的 lastname / firstname (或类似内容)的哈希设置该字段,则应避免插入重复数据。
如果您未指定BAD_REQUEST
所需的内容,则每行的弹性搜索为autogenerating unique ids。
修改强> 如果 lastname + firstname 对于您的数据集来说足够独特
_id