Logstash - 如何防止加载重复记录

时间:2017-05-06 06:48:48

标签: elasticsearch logstash

我们有一个名为employees的简单索引,其中我们只有2个字段firstname,lastname。使用logstash脚本,我们加载员工数据。我们不希望将重复记录存储到索引中,即使我们在数据文件中有重复项。在这种情况下,如果firstname + lastname相同,则不应将该记录添加到索引中。

logstash script is:

input { 
   file {
        path => "C:/employees.csv"
    } 
   }
filter {
    csv {
        columns => [
          "firstname",
          "lastname"
        ]
        separator => ","
        }
    }
output {
 elasticsearch{
    hosts => ["localhost:9200"]
   index => "employees"
    }
}

data file - employees.csv

john,doe
jane,doe
john,doe - this record should not be added to the index.

I went through lot of documentation and searched a lot for adding conditions in the filter clause. however, no luck so far.

Can any one provide inputs on this.

thanks

1 个答案:

答案 0 :(得分:1)

听起来您正在寻找Elasticsearch映射_id字段。如果您根据每行的 lastname / firstname (或类似内容)的哈希设置该字段,则应避免插入重复数据。

如果您未指定BAD_REQUEST所需的内容,则每行的弹性搜索为autogenerating unique ids

修改 如果 lastname + firstname 对于您的数据集来说足够独特

_id