Question

我正在使用DIH和Tika来索引不同语言的文档。

每种语言都有一个文件夹（例如/de/file001.pdf），我想从路径中提取语言，然后动态添加语言特定的solr字段（例如text_de）。

这是我尝试过的解决方案：

<dataConfig>
  <script><![CDATA[
    function addField(row) {
      row.put('text_' + row.get('lang'), row.get('text'));
      return row;
    }
  ]]></script>
  <dataSource type="BinFileDataSource" />
    <document>
      <entity name="files" dataSource="null" rootEntity="false"
          processor="FileListEntityProcessor"
          baseDir="/tmp/documents" fileName=".*\.(doc)|(pdf)|(docx)"
          onError="skip"
          recursive="true"
          transformer="RegexTransformer" query="select * from files">

        <field column="fileAbsolutePath" name="id" />
        <field column="lang" regex=".*/(\w*)/.*" sourceColName="fileAbsolutePath"/>

        <entity name="documentImport"
            processor="TikaEntityProcessor"
            url="${files.fileAbsolutePath}"
            format="text"
            transformer="script:addField">

          <field column="date" name="date" meta="true"/>
          <field column="title" name="title" meta="true"/>
        </entity>

    </entity>
</document>

这不起作用，因为row包含'text'字段，但不包含'lang'字段。

Answer 1

方法是正确的，但问题是您使用的行只作为当前行的范围。

为了访问父行，您必须使用您收到的上下文变量作为脚本函数的第二个实际参数。 Context变量具有ContextImpl实现，并且在每次脚本调用时，Solr ScriptTransformer将向您发送第二个参数（请参阅transformRow）相同的Context实例。

以下脚本允许您从父行中提取字段值，并应解决您的问题：

<dataConfig>
<script><![CDATA[
    function addField(row, context) {
    var lang = context.getParentContext().resolve('files.lang');
    row.put('text_' + row.get('lang'), row.get('text'));
    return row;
}
]]></script>

根据文件路径添加动态字段

1 个答案: