如何确保将语言分析应用于WordDelimiterTokenFilter生成的标记

时间:2017-05-02 02:49:17

标签: azure-search

此问题是我在应用FEMMES.COM修复程序未正确标记(How do I get french text FEMMES.COM to index as language variants of FEMMES)后面临的新情况

失败的测试案例:#FEMMES2017应该标记为Femmes,Femme,2017。

我使用MappingCharFilter的方法很可能不正确,而且实际上只是一个创可贴。这个让失败的测试用例通过的正确方法是什么?

当前索引配置

  "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "text_language_search_custom_analyzer",
      "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
      "tokenFilters": [
        "lowercase",
        "text_synonym_token_filter",
        "asciifolding",
        "language_word_delim_token_filter"
      ],
      "charFilters": [
        "html_strip",
        "replace_punctuation_with_comma"
      ]
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "text_exact_search_Index_custom_analyzer",
      "tokenizer": "text_exact_search_Index_custom_analyzer_tokenizer",
      "tokenFilters": [
        "lowercase",
        "asciifolding"
      ],
      "charFilters": []
    }
  ],
  "tokenizers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
      "name": "text_language_search_custom_analyzer_ms_tokenizer",
      "maxTokenLength": 300,
      "isSearchTokenizer": false,
      "language": "french"
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.StandardTokenizerV2",
      "name": "text_exact_search_Index_custom_analyzer_tokenizer",
      "maxTokenLength": 300
    }
  ],
  "tokenFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
      "name": "text_synonym_token_filter",
      "synonyms": [
        "ca => ça",
        "yeux => oeil",
        "oeufs,oeuf,Œuf,Œufs,œuf,œufs",
        "etre,ete"
      ],
      "ignoreCase": true,
      "expand": true
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
      "name": "language_word_delim_token_filter",
      "generateWordParts": true,
      "generateNumberParts": true,
      "catenateWords": false,
      "catenateNumbers": false,
      "catenateAll": false,
      "splitOnCaseChange": true,
      "preserveOriginal": false,
      "splitOnNumerics": true,
      "stemEnglishPossessive": true,
      "protectedWords": []
    }
  ],
  "charFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
      "name": "replace_punctuation_with_comma",
      "mappings": [
        "#=>,",
        "$=>,",
        "€=>,",
        "£=>,",
        "%=>,",
        "&=>,",
        "+=>,",
        "/=>,",
        "==>,",
        "<=>,",
        ">=>,",
        "@=>,",
        "_=>,",
        "µ=>,",
        "§=>,",
        "¤=>,",
        "°=>,",
        "!=>,",
        "?=>,",
        "\"=>,",
        "'=>,",
        "`=>,",
        "~=>,",
        "^=>,",
        ".=>,",
        ":=>,",
        ";=>,",
        "(=>,",
        ")=>,",
        "[=>,",
        "]=>,",
        "{=>,",
        "}=>,",
        "*=>,",
        "-=>,"
      ]
    }
  ]

分析API调用

{
  "analyzer": "text_language_search_custom_analyzer",
  "text": "#femmes2017"
}

分析API响应

{
  "@odata.context": "https://one-adscope-search-eu-prod.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult",
  "tokens": [
    {
      "token": "femmes",
      "startOffset": 1,
      "endOffset": 7,
      "position": 0
    },
    {
      "token": "2017",
      "startOffset": 7,
      "endOffset": 11,
      "position": 1
    }
  ]
}

1 个答案:

答案 0 :(得分:0)

输入文本由分析器的组件按顺序处理:char过滤器 - &gt; tokenizer - &gt;令牌过滤器。在您的情况下,令牌化程序在WordDelimiter令牌过滤器处理令牌之前执行词形还原。遗憾的是,Microsoft词干分析器和词形变换器不能作为独立的令牌过滤器使用,您可以在WordDelimiter令牌过滤器之后应用它们。您将需要添加另一个标记过滤器,该过滤器将根据您的要求规范化WordDelimiter标记过滤器的输出。只有这一个你关心的案例才能将SynonymsTokenFilter移动到分析器链的末尾,并将 femmes 映射到 femme 。这显然不是一个很好的解决方法,因为它非常特定于您正在处理的数据。希望我提供的信息可以帮助您找到更通用的解决方案。