我正在尝试从html页面获取所有超链接,并将它们作为文档添加到SOLR。
这是我的DIH配置xml
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource type="FileDataSource" name="fds" />
<dataSource type="FieldReaderDataSource" name="frds" />
<document>
<entity name="lines" processor="LineEntityProcessor"
acceptLineRegex="<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1"
url="/Users/naveen/AppsAndData/data/test-data/testdata.html"
dataSource="fds" transformer="RegexTransformer">
<field column="line" />
</entity>
</document>
</dataConfig>
mergedschema xml文件内容
<schema name="example-data-driven-schema" version="1.6">
<uniqueKey>id</uniqueKey>
<!-
---
-->
<field name="id" type="string" indexed="true" required="true" stored="true"/>
<field name="line" type="text_general" indexed="true" stored="true"/>
</schema>
运行完全导入时,状态显示为
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. (Duration: 01s)
Requests: 0 , Fetched: 4 4/s, Skipped: 0 , Processed: 0
我错过了什么吗,请在这里帮助我。
谢谢, Naveen
答案 0 :(得分:0)
id字段定义为required = true,另外它定义为uniqueKey。那可能是问题。您可以将其关闭然后重试吗?