Question

我想在我的应用程序中提供全文搜索功能，因此我试图配置具有认知搜索功能的Azure搜索，以便可以索引存储在Azure Blob存储中的图像以及非图像文档。但是，在使用Azure Search的REST API通过Java代码配置Azure Search时，我无法将OCR功能利用到Azure Search中，并且图像文档也未建立索引。通过Java代码（使用Azure Search REST API）配置Azure搜索时，我缺少一些配置详细信息。

案例1：我可以从Azure门户

要使用认知功能（包括OCR技能集），索引，索引器和Azure Blob存储来配置Azure搜索。
用于索引图像和非图像文档，例如pdf，png，jpg，xls等。
搜索索引文档

案例2：我可以使用Azure REST API从Java代码中获得

要使用认知功能，索引，索引器和Azure Blob存储来配置Azure搜索。
索引非图像文档，例如pdf，xls等
搜索索引文件但是，在使用Azure Search的REST API通过Java代码配置Azure Search时（在情况2中），我无法将OCR功能利用到Azure Search中，并且图像文档也未建立索引。通过Java代码（使用Azure Search REST API）配置Azure搜索时，我缺少一些配置详细信息。

我正在使用Java代码中的以下示例Azure Search Rest API 1. https://%s.search.windows.net/datasources?api-version=%s 2. https://%s.search.windows.net/skillsets/cog-search-demo-ss?api-version=%s 3. https://%s.search.windows.net/indexes/%s?api-version=%s 4. https://%s.search.windows.net/indexers?api-version=%s

配置json： 1. datasource.json

{
   "name" : "csstoragetest",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "connectionString" },
    "container" : { "name" : "csblob" }
}

skillset.json

{
   "description": "Extract text from images and merge with content text to produce merged_text",
  "skills":
  [
    {
      "description": "Extract text (plain and structured) from image.",
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document/normalized_images/*",
      "defaultLanguageCode": "null",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "myText"
        },
        {
          "name": "layoutText",
          "targetName": "myLayoutText"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name":"text", "source": "/document/content"
        },
        {
          "name": "itemsToInsert", "source": "/document/normalized_images/*/text"
        },
        {
          "name":"offsets", "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText", "targetName" : "merged_text"
        }
      ]
    }
  ]
}

index.json

{
  "name": "azureblob-indexing",
  "fields": [
    { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
    { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
  ]
}

indexer.json

{
    "name" : "azureblob-indexing1",
  "dataSourceName" : "csstoragetest",
  "targetIndexName" : "azureblob-indexing",
  "schedule" : { "interval" : "PT2H" },
  "skillsetName" : "cog-search-demo-ss",
  "parameters":
  {
    "maxFailedItems":-1,
    "maxFailedItemsPerBatch":-1,
    "configuration":
    {
      "dataToExtract": "contentAndMetadata",
      "imageAction":"generateNormalizedImages",
      "parsingMode": "default",
      "firstLineContainsHeaders": false,
      "delimitedTextDelimiter": ","
    }
  }
}

通过Java代码配置Azure搜索之后，Image文档应该在Azure搜索中建立索引，并且我应该能够基于其中包含的文本来搜索它们。

Answer 1

尝试将默认语言代码设置为null，在 skillset.json 中不加引号：

"defaultLanguageCode": null

Answer 2

我已经弄清楚自己需要的配置。它需要如上所述（在问题中）匹配案例1和案例2之间的所有参数，然后更新配置json。

如何通过Java以编程方式在天蓝色搜索中设置认知搜索功能（使用OCR）？

2 个答案: