在Azure认知搜索中,我可以将多个Blob添加到索引中单个记录的集合中吗

时间:2020-06-11 17:44:47

标签: azure-cognitive-search

我有一个Blob容器,其中每个文件夹代表我在ACS中建立索引的项目。文件夹名称是ACS索引中项目的键。想象一下以下容器结构:

container {
    item1 {
        blob1,
        blob2
    },
    item2 {
        blob3
    },
    item3 {
        blob4,
        blob5,
        blob6
    }
}

我希望能够对容器运行索引器,使用OcrSkill,KeyPhrases,EntityRecognition等技能从Blob中提取见解。 我知道我可以使用ShaperSkill将单个Blob /文档的信息转换成我喜欢的格式。例如:

List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
inputMappings.Add(new InputFieldMappingEntry(
    name: "content",
    source: "/document/content"));
inputMappings.Add(new InputFieldMappingEntry(
    name: "languageCode",
    source: "/document/languageCode"));
inputMappings.Add(new InputFieldMappingEntry(
    name: "keyPhrases",
    source: "/document/keyPhrases"));
inputMappings.Add(new InputFieldMappingEntry(
    name: "organizations",
    source: "/document/organizations"));
inputMappings.Add(new InputFieldMappingEntry(
    name: "name",
    source: "/document/name"));
List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
outputMappings.Add(new OutputFieldMappingEntry(
    name: "output",
    targetName: "myDoc"));
ShaperSkill shaperSkill = new ShaperSkill(
    description: "Shape to myDoc",
    context: "/document",
    name: "Doc Shaper",
    inputs: inputMappings,
    outputs: outputMappings);

对于索引器本身,我可以像这样从metadata_storage_path中提取文件夹名称:

List<FieldMapping> fieldMappings = new List<FieldMapping>();
fieldMappings.Add(new FieldMapping(
        sourceFieldName: "metadata_storage_path",
        targetFieldName: "key",
        mappingFunction: FieldMappingFunction.ExtractTokenAtPosition("/", 4)));

我不知道该怎么做(或者什至可以做到)是对/document/myDoc输出字段进行多个引用,并在ACS索引的集合中获取多个条目。我想要的输出如下: ...(仅在此处显示相关字段)

{
    "value": [
        {
            "key": "item1",
            "myDocs": [
                {
                    "name": "blob1",
                    "content": "<content from blob1>",
                    "languageCode": "<languageCode from blob1>",
                    "keyPhrases": "<keyPhrases from blob1>",
                    "organizations": "<organizations from blob1>"
                },
                {
                    "name": "blob2",
                    "content": "<content from blob2>",
                    "languageCode": "<languageCode from blob2>",
                    "keyPhrases": "<keyPhrases from blob2>",
                    "organizations": "<organizations from blob2>"
                }
            ]
        },
        {
            "key": "item2",
            "myDocs": [
                {
                    "name": "blob3",
                    "content": "<content from blob3>",
                    "languageCode": "<languageCode from blob3>",
                    "keyPhrases": "<keyPhrases from blob3>",
                    "organizations": "<organizations from blob3>"
                }
            ]
        },
        {
            "key": "item3",
            "myDocs": [
                {
                    "name": "blob4",
                    "content": "<content from blob4>",
                    "languageCode": "<languageCode from blob4>",
                    "keyPhrases": "<keyPhrases from blob4>",
                    "organizations": "<organizations from blob4>"
                },
                {
                    "name": "blob5",
                    "content": "<content from blob5>",
                    "languageCode": "<languageCode from blob5>",
                    "keyPhrases": "<keyPhrases from blob5>",
                    "organizations": "<organizations from blob5>"
                },
                {
                    "name": "blob6",
                    "content": "<content from blob6>",
                    "languageCode": "<languageCode from blob6>",
                    "keyPhrases": "<keyPhrases from blob6>",
                    "organizations": "<organizations from blob6>"
                }
            ]
        }
    ]
}

有人知道我能做什么吗?

1 个答案:

答案 0 :(得分:0)

索引器不提供跨多个文档聚合到单个索引字段的功能,因为其更改跟踪可能会多次处理blob,从而导致不确定的结果。解决方案是创建两个索引,一个索引用于Blob,一个索引用于父记录。您可以使用外部进程从Blob索引中读取数据,以批量更新父索引,这应该具有更简单的聚合逻辑,但需要管理外部触发器;或在处理Blob时使用Custom Web API skill更新父索引。如果子blob不存在,则自定义技能的聚合逻辑可能更复杂,以至于仅选择性地添加到父记录中。查阅examples,了解如何设置Azure函数并将技能连接到该函数。