Question

我是索尔的新手，我试图测试它的功能。我来自RDBMS世界，并想知道Solr将如何处理我的数据。

我创建了一个新核心：

$ bin/solr create -c test

并使用以下方法成功加载了JSON文件：

$ bin/post -c test file.json

file.json的第一条记录如下：

{"attr":"01234"}

但Solr将其存储为：

{"attr":1234}

我开始在this tutorial (Youtube video)之后定义数据导入处理程序，以便正确存储我的数据，并发现JH无法由DIH处理。我坚持data-config.xml的定义，因为该教程使用XPathEntityProcessor处理XML文件但无法找到JSON甚至是CSV处理器（我可以轻松检索CSV版本） file.json，所以加载CSV或JSON对我来说是一样的）。官方文档有点混乱，并没有提供许多有用的例子。可能处理JSON和CSV文档的单独处理器是LineEntityProcessor和PlainTextEntityProcessor（Official Documentation）。

来自Solr Wiki的

This other link声明：

目标


...

可以插入任何类型的数据源（ftp，scp等）和任何其他用户选择格式（ JSON ， csv 等）

所以我猜这真的有可能，但是怎么样？

我发现2014年发布的similar question没有人在这里回答，所以想知道2016年是否有更新版本的Solar，有一个众所周知的解决方案。

所以问题是：如何使用特定的数据模式导入JSON和CSV文档？

的更新

执行http://localhost:8983/solr/test/dihupdate?command=full-import不会触发任何错误，但不会加载任何文档。以下是位于核心目录中的各种xml文件：

solrconfig.xml

...
<schemaFactory class="ClassicIndexSchemaFactory" />
...
<requestHandler name="/dihupdate" class="org.apache.solr.handler.dataimport.DataImportHandler" startup="lazy">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>
...

schema.xml

...
<field name="id" type="long" indexed="true" stored="true" required="true" multiValued="false" />
<field name="attr1" type="string" indexed="true" stored="true" required="true" multiValued="true" />
<field name="date" type="date" indexed="true" stored="false" multiValued="true" />
<field name="attr2" type="string" indexed="true" stored="true"  multiValued="true" />
<field name="attr3" type="string" indexed="true" stored="true" multiValued="true" />
<field name="attr4" type="int" indexed="false" stored="true" multiValued="true" />
<uniqueKey>id</uniqueKey>
...

data-config.xml

<dataConfig>
    <dataSource type="FileDataSource" />
    <document>
        <entity name="f" processor="FileListEntityProcessor"
                fileName="test.json"
                rootEntity="false"
                dataSource="null"
                recursive="true"
                baseDir="/path/to/data/"/>
    </document>
</dataConfig>

Answer 1

在schema.xml目录中的conf中完成定义架构 - 这是traditional way of setting up the expected format for documents (Defining Fields)。如果您使用的是“托管架构”模式（当前默认模式），则必须switch to using the classic schema factory。然后，您可以按照示例模式或Web上可用于描述schema.xml文件结构的任何资源（您定义字段类型，然后定义使用该字段类型的字段）来定义schema.xml中的字段）。

另一个选项是托管架构 - 这是最新版本中的默认设置，此架构通过Solr提供的API进行操作。在启动时，它从schema.xml（如果存在）读取初始模式，但之后您必须通过API或Admin界面对其进行修改。在Solr指南的Schema API page中描述了这个API（带有示例）。

使用StrField（“string使用的字段类型”来存储012345将导致Solr仅存储文字值012345，而不将其转换为整数。这可能是一个很好的起点。

Answer 2

在Solr发行版中，有一个电影示例（在示例/电影中），它展示了如何索引JSON并利用精确的字段定义和类型自动检测。说明（ README.txt ）包含您在忘记执行其中一个步骤时会看到的结果。

我建议您尝试一下，然后将这些知识应用到您自己的用例中。

Solr：如何在JSON和CSV导入期间指定架构？

的更新

2 个答案: