Question

我们有一个名为employees的简单索引，其中我们只有2个字段firstname，lastname。使用logstash脚本，我们加载员工数据。我们不希望将重复记录存储到索引中，即使我们在数据文件中有重复项。在这种情况下，如果firstname + lastname相同，则不应将该记录添加到索引中。

logstash script is:

input { 
   file {
        path => "C:/employees.csv"
    } 
   }
filter {
    csv {
        columns => [
          "firstname",
          "lastname"
        ]
        separator => ","
        }
    }
output {
 elasticsearch{
    hosts => ["localhost:9200"]
   index => "employees"
    }
}

data file - employees.csv

john,doe
jane,doe
john,doe - this record should not be added to the index.

I went through lot of documentation and searched a lot for adding conditions in the filter clause. however, no luck so far.

Can any one provide inputs on this.

thanks

Answer 1

听起来您正在寻找Elasticsearch映射_id字段。如果您根据每行的 lastname / firstname （或类似内容）的哈希设置该字段，则应避免插入重复数据。

如果您未指定BAD_REQUEST所需的内容，则每行的弹性搜索为autogenerating unique ids。

修改如果 lastname + firstname 对于您的数据集来说足够独特

_id

Logstash - 如何防止加载重复记录

1 个答案: