正如标题中所述,我正在尝试使用索引器为blob存储中的XML文件建立索引,但是被索引的文件仍然包含XML标记,而不是被解析和剥离标签。
以下XML是我要索引的XML代码的示例:
<article xmlns="http://ournamespace.com/ns/test" xmlns:xlink="http://www.w3.org/1999/xlink">
<title>Lorem ipsum</title>
<section>
<para> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. (<link xlink:href="f8c79a4d-f1f1-4d8d-ab4c-d9317754465e"> Ut enim ad minim veniam</link>). quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</para>
</section>
<section>
<title>
<emphasis role="strong"> Duis aute irure dolor in reprehenderit in voluptate velit esse</emphasis>
</title>
<itemizedlist mark="square">
<listitem>
<para>
<link xlink:href="d1cce80e-835f-4b37-892e-54bba282f437">cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt</link> in culpa qui officia deserunt mollit anim id est laborum. </para>
</listitem>
</itemizedlist>
</section>
</article>
我已经确保blob的内容类型为application/xml
,并且索引器解析模式为default
,因为我在this question中读到,这对于能够正确解析文档。
这是索引器配置:
{
"@odata.context": "https://<redacted>.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8D7D7C7CF735BF4\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "blob-indexer",
"skillsetName": null,
"targetIndexName": "<redacted>-index",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "default"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path",
"mappingFunction": {
"name": "base64Encode"
}
},
{
"sourceFieldName": "Subscriptions",
"targetFieldName": "Subscriptions",
"mappingFunction": {
"name": "jsonArrayToStringCollection"
}
}
],
"outputFieldMappings": [],
"cache": null
}
建立索引的文档显示以下元数据,我发现这很奇怪,因为content_type突然从一种类型更改为另一种类型。我认为这意味着该文档使用的解析模式为text
?
{
"metadata_storage_content_type": "application/xml",
"metadata_content_encoding": "UTF-8",
"metadata_content_type": "text/plain; charset=UTF-8"
}
这是有关被索引的blob之一的信息:
LAST MODIFIED 4/2/2020, 3:28:13 PM
CREATION TIME 4/2/2020, 3:28:13 PM
TYPE Block blob
SIZE 27.94 KiB
ACCESS TIER Hot (Inferred)
ACCESS TIER LAST MODIFIED N/A
SERVER ENCRYPTED true
ETAG 0x8D7D709B1C7C5D5
CONTENT-TYPE application/xml
CONTENT-MD5 3yeFKKcSGh/6DJawrAWaWg==
LEASE STATUS Unlocked
LEASE STATE Available
LEASE DURATION -
COPY STATUS -
COPY COMPLETION TIME -
任何帮助将不胜感激,我当然愿意在必要时提供更多信息。预先感谢!