在Solr

时间:2016-05-04 09:02:51

标签: solr solrj solr5

在使用SolrJ索引DB文档时,我能够发现Solr(5.2.1)中存在重复的文档。我想避免重复,并根据“id”字段重写文档。使用我的谷歌搜索,“重复数据删除”对重复很有用,所以我将它应用到solrconfig.xml但遗憾的是它没有用。

if there are two same documents then rewrite with latest one. for example,
   "id" = 750000 "title" = here I am
   "id" = 750000 "title" = here you are 
hence, final result would be "id" =750000 "title" = here you are

    //here is my part of schema.xml

    <field name="id" type="long" indexed="true" stored="true" required="true"/>
    <field name="title" type="string" indexed="true" stored="true" required="true" />
    <field name="unique_id" type="string" multiValued="false" indexed="true" required="false" stored="true"
    <uniqueKey>unique_id</uniqueKey>

    //below code is solrconfig.xml   

    <updateRequestProcessorChain name="dedupe">

    <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">id</str>
         <bool name="overwriteDupes">true</bool>
         <str name="fields">id</str>
         <str name="signatureClass">solr.processor.TextProfileSignature</str>
       </processor>

       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

need your kind advice.

below code is core parts of my indexing programe with SolrJ (edited on 2015.05.08)

 SolrClient solr = new HttpSolrClient(urlArray[i]); //localhost:8983/solr/#/core_name[i] 
      String id;
      SolrInputDocument doc = new SolrInputDocument();
      UpdateResponse response;
      String[] array;

      for (Map.Entry<String,Object> entry : list.get(i).entrySet()) { // get my DB values such as id, title ,description...

        array = String.valueOf(entry.getValue()).split(","); // split DB values depend on ","
        id = entry.getKey();
        doc.addField("id", entry.getKey()); // unique id
        doc.addField("title", array[1]);

        doc.addField("link", array[2]);
        doc.addField("description", array[3]);
        response = solr.add(doc);

        doc.clear();

      }

      solr.commit();
      solr.close();

1 个答案:

答案 0 :(得分:0)

确保更改您的更新处理程序(您在SolrJ中使用的处理程序)以使用已定义的链(在您的情况下&#34;重复数据删除&#34;)

<requestHandler name="/update" class="solr.UpdateRequestHandler" >
  <lst name="defaults">
    <str name="update.chain">dedupe</str>
  </lst>
...
</requestHandler>

看看这个网址 https://cwiki.apache.org/confluence/display/solr/De-Duplication