Azure认知搜索索引器从Blob存储中进行索引,但不剥离XML内容的标记

时间:2020-04-03 13:06:24

标签: azure azure-cognitive-search azure-blob-storage indexer

正如标题中所述,我正在尝试使用索引器为blob存储中的XML文件建立索引,但是被索引的文件仍然包含XML标记,而不是被解析和剥离标签。

以下XML是我要索引的XML代码的示例:

<article xmlns="http://ournamespace.com/ns/test" xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>Lorem ipsum</title>
  <section>
    <para> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. (<link xlink:href="f8c79a4d-f1f1-4d8d-ab4c-d9317754465e"> Ut enim ad minim veniam</link>).  quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</para>
      </section>
  <section>
    <title>
      <emphasis role="strong"> Duis aute irure dolor in reprehenderit in voluptate velit esse</emphasis>
    </title>
    <itemizedlist mark="square">
      <listitem>
        <para>
          <link xlink:href="d1cce80e-835f-4b37-892e-54bba282f437">cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt</link> in culpa qui officia deserunt mollit anim id est laborum. </para>
      </listitem>
    </itemizedlist>
  </section>
</article>

我已经确保blob的内容类型为application/xml,并且索引器解析模式为default,因为我在this question中读到,这对于能够正确解析文档。

这是索引器配置:

{
    "@odata.context": "https://<redacted>.search.windows.net/$metadata#indexers/$entity",
    "@odata.etag": "\"0x8D7D7C7CF735BF4\"",
    "name": "azureblob-indexer",
    "description": "",
    "dataSourceName": "blob-indexer",
    "skillsetName": null,
    "targetIndexName": "<redacted>-index",
    "disabled": null,
    "schedule": null,
    "parameters": {
        "batchSize": null,
        "maxFailedItems": 0,
        "maxFailedItemsPerBatch": 0,
        "base64EncodeKeys": null,
        "configuration": {
            "dataToExtract": "contentAndMetadata",
            "parsingMode": "default"
        }
    },
    "fieldMappings": [
        {
            "sourceFieldName": "metadata_storage_path",
            "targetFieldName": "metadata_storage_path",
            "mappingFunction": {
                "name": "base64Encode"
            }
        },
        {
            "sourceFieldName": "Subscriptions",
            "targetFieldName": "Subscriptions",
            "mappingFunction": {
                "name": "jsonArrayToStringCollection"
            }
        }
    ],
    "outputFieldMappings": [],
    "cache": null
}

建立索引的文档显示以下元数据,我发现这很奇怪,因为content_type突然从一种类型更改为另一种类型。我认为这意味着该文档使用的解析模式为text

{
   "metadata_storage_content_type": "application/xml",
   "metadata_content_encoding": "UTF-8",
   "metadata_content_type": "text/plain; charset=UTF-8"
}

这是有关被索引的blob之一的信息:

LAST MODIFIED               4/2/2020, 3:28:13 PM
CREATION TIME               4/2/2020, 3:28:13 PM
TYPE                        Block blob
SIZE                        27.94 KiB
ACCESS TIER Hot (Inferred)
ACCESS TIER LAST MODIFIED   N/A
SERVER ENCRYPTED            true
ETAG                        0x8D7D709B1C7C5D5
CONTENT-TYPE                application/xml
CONTENT-MD5                 3yeFKKcSGh/6DJawrAWaWg==
LEASE STATUS                Unlocked
LEASE STATE                 Available
LEASE DURATION              -
COPY STATUS                 -
COPY COMPLETION TIME        -

任何帮助将不胜感激,我当然愿意在必要时提供更多信息。预先感谢!

0 个答案:

没有答案