在ElasticSearch NEST中创建自定义标记生成器

时间:2017-01-05 09:43:13

标签: asp.net elasticsearch nest elasticsearch-2.0 elasticsearch-net

我在ES 2.5中有以下自定义类:

Title
DataSources
Content

运行搜索很好,除了中间字段 - 使用“|”分隔符构建/索引。

  

ex:“| 4 | 7 | 8 | 9 | 10 | 12 | 14 | 19 | 20 | 21 | 22 | 23 | 29 | 30”

我需要构建一个与所有字段中的某些字段匹配的查询,并在DataSource字段中匹配至少一个数字。

总结一下我目前的情况:

    QueryBase query = new SimpleQueryStringQuery
    {
        //DefaultOperator = !operatorOR ? Operator.And : Operator.Or,
        Fields = LearnAboutFields.FULLTEXT,
        Analyzer = "standard",
        Query = searchWords.ToLower()
    };
    _boolQuery.Must = new QueryContainer[] {query};

这是搜索词查询。

    foreach (var datasource in dataSources)
    {
        // Add DataSources with an OR
        queryContainer |= new WildcardQuery { Field = LearnAboutFields.DATASOURCE, Value = string.Format("*{0}*", datasource) };
    }
    // Add this Boolean Clause to our outer clause with an AND
    _boolQuery.Filter = new QueryContainer[] {queryContainer};
}

这是数据源查询。可以有多个数据源。

它不起作用,并在添加了过滤器查询的情况下返回结果。我想我需要在标记器/分析器上做一些工作,但我不太了解ES可以解决这个问题。

编辑:Per Val的评论如下我试图像这样重新编码索引器:

        _elasticClientWrapper.CreateIndex(_DataSource, i => i
            .Mappings(ms => ms
                .Map<LearnAboutContent>(m => m
                    .Properties(p => p
                        .String(s => s.Name(lac => lac.DataSources)
                            .Analyzer("classic_tokenizer")
                            .SearchAnalyzer("standard")))))
            .Settings(s => s
                .Analysis(an => an.Analyzers(a => a.Custom("classic_tokenizer", ca => ca.Tokenizer("classic"))))));
        var indexResponse = _elasticClientWrapper.IndexMany(contentList);

使用数据成功构建。但是查询仍然无法正常工作。

DataSources的新查询:

        foreach (var datasource in dataSources)
        {
            // Add DataSources with an OR
            queryContainer |= new TermQuery {Field = LearnAboutFields.DATASOURCE, Value = datasource};
        }
        // Add this Boolean Clause to our outer clause with an AND
        _boolQuery.Must = new QueryContainer[] {queryContainer};

JSON:

{"learnabout_index":{"aliases":{},"mappings":{"learnaboutcontent":{"properties":{"articleID":{"type":"string"},"content":{"type":"string"},"dataSources":{"type":"string","analyzer":"classic_tokenizer","search_analyzer":"standard"},"description":{"type":"string"},"fileName":{"type":"string"},"keywords":{"type":"string"},"linkURL":{"type":"string"},"title":{"type":"string"}}}},"settings":{"index":{"creation_date":"1483992041623","analysis":{"analyzer":{"classic_tokenizer":{"type":"custom","tokenizer":"classic"}}},"number_of_shards":"5","number_of_replicas":"1","uuid":"iZakEjBlRiGfNvaFn-yG-w","version":{"created":"2040099"}}},"warmers":{}}}

Query JSON请求:

{
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "fields": [
              "_all"
            ],
            "query": "\"housing\"",
            "analyzer": "standard"
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "DataSources": [
              "1"
            ]
          }
        }
      ]
    }
  }
}

2 个答案:

答案 0 :(得分:3)

实现此目的的一种方法是创建一个classic tokenizer的自定义分析器,它会将DataSources字段分解为组成它的数字,即它会标记每个|上的字段字符。

因此,在创建索引时,您需要添加此自定义分析器,然后在DataSources字段中使用它:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "number_analyzer": {
          "type": "custom",
          "tokenizer": "number_tokenizer"
        }
      },
      "tokenizer": {
        "number_tokenizer": {
          "type": "classic"
        }
      }
    }
  },
  "mappings": { 
    "my_type": {
      "properties": {
        "DataSources": {
          "type": "string",
          "analyzer": "number_analyzer",
          "search_analyzer": "standard"
        }
      }
    }
  }
}

因此,如果您对字符串"|4|7|8|9|10|12|14|19|20|21|22|23|29|30"编制索引,则DataSources字段将有效地包含以下令牌数组:[4, 7, 8, 9, 10, 12, 14, 191, 20, 21, 22, 23, 29, 30]

然后你可以摆脱WildcardQuery而只需使用TermsQuery代替:

terms = new TermsQuery {Field = LearnAboutFields.DATASOURCE, Terms = dataSources }
// Add this Boolean Clause to our outer clause with an AND
_boolQuery.Filter = new QueryContainer[] { terms };

答案 1 :(得分:1)

初看一下您的代码我认为您可能遇到的一个问题是,不会分析过滤器子句中的任何查询。因此,基本上该值不会被分解为令牌,并将进行整体比较。

很容易忘记这一点,因此需要分析的任何值都需要放在must或should子句中。