Question

我的要求如下：

无论hadoop中有什么数据，我都需要搜索它（反之亦然）。

因此，为此，我使用ElasticSearch我们可以使用elasticsearch-hadoop plug-in将数据从hadoop发送到Elastic。现在可以进行实时搜索。

但是，我的问题是，不存在重复数据。无论hadoop中的数据是什么，在弹性搜索中都是相同的索引。有没有办法摆脱这种重复或我的概念是错误的。我搜索了很多，但没有找到任何有关此重复问题的线索。

Answer 1

如果为elasticsearch中的每一行指定不可变ID（例如：customerID），则现有数据的所有插入都只是更新。

摘自关于插入方法的官方文档（cf http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/configuration.html#_operation）：

index（默认值）：在现有数据的基础上添加新数据（基于其 id）被替换（重新编制索引）。

如果您有＆＃34;客户＆＃34; pig中的数据集，只需存储数据：

A = LOAD '/user/hadoop/customers.csv' USING PigStorage()
                    ....;

B = FOREACH A GENERATE customerid, ...;


STORE B INTO 'foo/customer' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes = localhost','es.http.timeout = 5m','es.index.auto.create = true','es.input.json = true','es.mapping.id =customerid','es.batch.write.retry.wait = 30', 'es.batch.size.entries = 500');
--,'es.mapping.parent = customer');

要在Hadoop上执行新搜索，只需使用自定义加载程序：

A = LOAD 'foo/customer' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*');

ElasticSearch与Hadoop数据重复问题

1 个答案: