我可以在Elasticsearch中结合使用通配符搜索和全文搜索吗?

时间:2020-03-05 02:31:36

标签: elasticsearch wildcard elasticsearch-dsl elasticsearch-query

例如,我在Elasticsearch中有一些这样的书名数据,
gamexxx_nightmare
gamexxx_little_guy

然后我输入
game =>搜索gamexxx_nightmaregamexxx_little_guy
little guy =>搜索gamexxx_little_guy吗?

首先,我认为我将使用通配符使gamegamexxx匹配,其次是全文搜索? 如何将它们组合在一个DSL中?

2 个答案:

答案 0 :(得分:2)

虽然Jaspreet的答案是正确的,但没有按照OP在他的问题如何将它们合并在一个DSL中的要求”中将两个要求合并在一个查询DSL中?

这是对Jaspreet解决方案的增强,因为我也没有使用通配符,甚至避免使用n-gram分析器,后者过于昂贵(增加了索引大小),并且如果需求发生变化,则需要重新索引。

可以将以下两个条件组合在一起的一个Search查询:

索引映射

{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "char_filter": [
                        "replace_underscore" -->note this
                    ]
                }
            },
            "char_filter": {
                "replace_underscore": {
                    "type": "mapping",
                    "mappings": [
                        "_ => \\u0020"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer" : "my_analyzer"
            }
        }
    }
}

索引您的样本文档

{
   "title" : "gamexxx_little_guy"
}

And

{
   "title" : "gamexxx_nightmare"
}

单次搜索查询

{
    "query": {
        "bool": {
            "must": [ --> note this
                {
                    "bool": {
                        "must": [
                            {
                                "prefix": {
                                    "title": {
                                        "value": "game"
                                    }
                                }
                            }
                        ]
                    }
                },
                {
                    "bool": {
                        "must": [
                            {
                                "match": {
                                    "title": {
                                        "query": "little guy"
                                    }
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

结果

 {
        "_index": "so-46873023",
        "_type": "_doc",
        "_id": "2",
        "_score": 2.2814486,
        "_source": {
           "title": "gamexxx_little_guy"
        }
     }

重要点:

  1. 查询的第一部分是prefix查询,它将与两个文档中的game相匹配。 (这样可以避免使用昂贵的正则表达式)。
  2. 第二部分是允许全文搜索,为此,我使用了自定义分析器,该分析器将_替换为空格,因此您不需要昂贵的(索引中的n-grams)和简单的匹配查询将获取结果。
  3. 在上述查询中,返回与两个条件均匹配的结果,如果要返回与任何条件均匹配的结果,可以将高级bool子句从should更改为must

答案 1 :(得分:1)

NGrams比通配符具有更好的性能。对于通配符,必须扫描所有文档以查看匹配的格式。 Ngram用小标记将文本打断。 Ex Quick Fox将根据min_gram和max_gram的大小存储为[Qui,uic,ick,Fox,oxe,xes]。

 I would appreciate if some one could provide another alternative too apart from this approach in R

X_Train <- as(X_train, "dgCMatrix")


GS_LogLoss = data.frame("Rounds" = numeric(), 
                        "Depth" = numeric(),
                        "r_sample" = numeric(),
                        "c_sample" = numeric(), 
                        "minLogLoss" = numeric(),
                        "best_round" = numeric())

for (rounds in seq(50,100, 25)) {
  
  for (depth in c(4, 6, 8, 10)) {
    
    for (r_sample in c(0.5, 0.75, 1)) {
      
      for (c_sample in c(0.4, 0.6, 0.8, 1)) {
        
        for (imb_scale_pos_weight in c(5, 10, 15, 20, 25))	{
          
          for (wt_gamma in c(5, 7, 10)) {
            
            for (wt_max_delta_step in c(5,7,10)) {
              
              for (wt_min_child_weight in c(5,7,10,15))	{
                
                
                set.seed(1024)
                eta_val = 2 / rounds
                cv.res = xgb.cv(data = X_Train, nfold = 2, label = y_train, 
                                nrounds = rounds, 
                                eta = eta_val, 
                                max_depth = depth,
                                subsample = r_sample, 
                                colsample_bytree = c_sample,
                                early.stop.round = 0.5*rounds,
                                scale_pos_weight= imb_scale_pos_weight,
                                max_delta_step = wt_max_delta_step,
                                gamma = wt_gamma,
                                objective='binary:logistic', 
                                eval_metric = 'auc',
                                verbose = FALSE)
                
                print(paste(rounds, depth, r_sample, c_sample, min(as.matrix(cv.res)[,3]) ))
                GS_LogLoss[nrow(GS_LogLoss)+1, ] = c(rounds, 
                                                     depth, 
                                                     r_sample, 
                                                     c_sample, 
                                                     min(as.matrix(cv.res)[,3]), 
                                                     which.min(as.matrix(cv.res)[,3]))
                
              }
            }
          }
        }	
      }
    }
  }	
}

查询

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text":{
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

如果只想使用通配符,则可以搜索not_analyzed字符串。这样可以处理单词之间的空格

GET my_index/_search
{
  "query": {
    "match": {
      "text": "little guy"
    }
  }
}
相关问题