使用DataImportHandler在Solr中导入嵌套文档

时间:2016-04-14 14:54:00

标签: solr dataimporthandler

我正在开发一个项目,其中规范要求Solr数据集合中的父子关系......即用户和他们所说的语言集合(每个语言由多个数据字段组成)。我的生产系统是4.10 Solr实现,但我也有5.5实现。到目前为止,我没有让它在任何一个上工作,我还没有找到关于如何实现它的完整文档来源。

目标是从Solr获取结果文档,如下所示:

{
    "id": 123,
    "firstName": "John",
    "lastName": "Doe",
    "languagesSpoken": [
        {
            "id": 243,
            "abbreviation": "en",
            "name": "English"
        },
        {
            "id": 442,
            "abbreviation": "fr",
            "name": "French"
        }
    ]
}

在我的schema.xml中,我将所有字段弄清楚如下:

<field name="id" type="int" indexed="true" stored="true" required="true" multiValued="false" />
<field name="firstName" type="text_general" indexed="true" stored="true" />
<field name="lastName" type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="languagesSpoken_id" type="int" indexed="true" stored="true" />
<field name="languagesSpoken_abbreviation " type="text_general" indexed="true" stored="true" />
<field name="languagesSpoken_name" type="text_general" indexed="true" stored="true" />

我的db-data-config.xml的最新版本如下所示:

<dataConfig>
    <dataSource driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" url="jdbc:...." />
        <document name="clients">
            <entity name="client" query="SELECT * FROM clients" deltaImportQuery="SELECT * FROM clients WHERE id = ${dih.delta.id}" deltaQuery="SELECT id FROM clients WHERE updateDate > '${dih.last_index_time}'">

                <field column="id" name="id" />
                <field column="firstName" name="firstName" />
                <field column="lastName" name="lastName" />

                <entity name="languagesSpoken" child="true" query="SELECT id, abbreviation, name FROM languages WHERE clientId = ${client.id}">
                    <field name="languagesSpoken_id" column="id" />
                    <field name="languagesSpoken_abbreviation" column="abbreviation" />
                    <field name="languagesSpoken_name" column="name" />
                </entity>
            </entity>
        </document>
        ...

在4.10服务器上,当数据来自Solr时,我得到一个平面文档记录,其中一个语言的字段与firstName和lastname内联,如下所示:

{
    "id": 123,
    "firstName": "John",
    "lastName": "Doe",
    "languagesSpoken_id": 243,
    "languagesSpoken_abbreviation ": "en",
    "languagesSpoken_name": "English"
}

在5.5服务器上,当数据出来时,我获得了根客户端文档和子语言文档的单独文档,它们之间没有任何关系,如下所示:

{
    "id": 123,
    "firstName": "John",
    "lastName": "Doe"
},
{
    "languagesSpoken_id": 243,
    "languagesSpoken_abbreviation": "en",
    "languagesSpoken_name": "English"
},
{
    "languagesSpoken_id": 442,
    "languagesSpoken_abbreviation": "fr",
    "languagesSpoken_name": "French"
}

我花了几天时间试图找出这里发生的事情无济于事。任何人都可以向我提供一个关于我在这里缺少什么的指针吗?

谢谢, - 杰夫

1 个答案:

答案 0 :(得分:0)

在导入SOLR之前,您可能希望将下面的json对象展平;

https://stackoverflow.com/a/19101235/929902

POST http://localhost:8983/solr/ggg_core/update?boost=1.0&commitWithin=1000&overwrite=true&wt=json HTTP/1.1

然后,一旦您从SOLR阅读,您可以以类似的方式解除它。