Question

我通过Dataimporthandler从MySql导入数据。这非常有效，我得到了这样的信息：

索引已完成。添加/更新：2,172个文档。已删除0 文档。（时长：01秒）请求：1（1 / s），摘要：2,172 （2,172 / s），跳过：0，已处理：2,172（2,172 / s）

但是，当我查看我的概述时，它说：

Num Docs：1470 Max Doc：2172 Deleted Docs：702

所以有702个文件因为我无法弄清楚而被删除。在我的架构中，我不会使用任何独特的字段或可能会给重复项带来麻烦的东西。

数据-config.xml中

<dataConfig>
  <dataSource type="JdbcDataSource"
    driver="com.mysql.jdbc.Driver"
    url="xxx"
    user="xxx"
    password="xxx"
  />
  <document>
   <entity name="product" query="CALL getSolrProducts();" transformer="RegexTransformer">
      <field column="uuid" name="uuid"/>
      <field column="id" name="id"/>
      <field column="productimage" name="productimage"/>
      <field column="producturl" name="producturl"/>
      <field column="productpricenew" name="productpricenew"/>
      <field column="productpriceold" name="productpriceold"/>
      <field column="brandid" name="productbrand"/>
      <field column="productbrandname" name="productbrandname"/>
      <field column="productbrandurl" name="productbrandurl"/>
      <field column="productbrandimage" name="productbrandimage"/>
      <field column="productbranddata" name="productbranddata"/>
      <field column="productshippingcoast" name="productshippingcoast"/>
      <field column="productlink" name="productlink"/>
      <field column="color" name="color" splitBy=","/>
      <field column="colordata" name="colordata" splitBy=","/>
      <field column="productdescription" name="productdescription"/>
      <field column="upc" name="upc" splitBy=","/>
      <field column="productname" name="productname"/>
      <field column="productshop" name="productshop"/>
      <field column="productshopname" name="productshopname"/>
      <field column="productshopimage" name="productshopimage"/>
      <field column="productimagethumb" name="productimagethumb"/>
      <field column="productshopdata" name="productshopdata"/>
    <field column="cat1id" name="cat1id"/>
    <field column="cat2id" name="cat2id"/>
    <field column="cat3id" name="cat3id"/>
    <field column="cat4id" name="cat4id"/>
    <field column="cat1data" name="cat1data"/>
    <field column="cat2data" name="cat2data"/>
    <field column="cat3data" name="cat3data"/>
    <field column="cat4data" name="cat4data"/>
      <field column="size" name="size" splitBy=","/>
      <field column="sizedata" name="sizedata" splitBy=","/>
      <field column="recommendations" name="recommendations" splitBy=","/>
    </entity>
  </document>
</dataConfig>

任何指针？

Answer 1

自检查clean后，DIH首先发出“全部删除”更新查询，然后开始发布新文档。索引完成后，DIH会发出一个提交，它只会保留已发布的新文档并删除索引开始之前存在的所有旧文档。您的数据库必须已更新，因此您现在获得了更多文档，702个已删除的文档对应于索引开始之前索引中存在的文档。（在DIH中检查optimize将清除已删除的文档，但对于大型索引，优化可能会很昂贵，并且删除的文档无论如何都不会显示在搜索结果中，因此可能没什么好处。）

为什么Solr在导入后删除文档

1 个答案: