例如,我在Elasticsearch中有一些这样的书名数据,
gamexxx_nightmare
,
gamexxx_little_guy
然后我输入
game
=>搜索gamexxx_nightmare
和gamexxx_little_guy
little guy
=>搜索gamexxx_little_guy
吗?
首先,我认为我将使用通配符使game
与gamexxx
匹配,其次是全文搜索?
如何将它们组合在一个DSL中?
答案 0 :(得分:2)
虽然Jaspreet的答案是正确的,但没有按照OP在他的问题如何将它们合并在一个DSL中的要求”中将两个要求合并在一个查询DSL中?。
这是对Jaspreet解决方案的增强,因为我也没有使用通配符,甚至避免使用n-gram分析器,后者过于昂贵(增加了索引大小),并且如果需求发生变化,则需要重新索引。
可以将以下两个条件组合在一起的一个Search查询:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_underscore" -->note this
]
}
},
"char_filter": {
"replace_underscore": {
"type": "mapping",
"mappings": [
"_ => \\u0020"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer" : "my_analyzer"
}
}
}
}
{
"title" : "gamexxx_little_guy"
}
And
{
"title" : "gamexxx_nightmare"
}
{
"query": {
"bool": {
"must": [ --> note this
{
"bool": {
"must": [
{
"prefix": {
"title": {
"value": "game"
}
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"title": {
"query": "little guy"
}
}
}
]
}
}
]
}
}
}
{
"_index": "so-46873023",
"_type": "_doc",
"_id": "2",
"_score": 2.2814486,
"_source": {
"title": "gamexxx_little_guy"
}
}
重要点:
prefix
查询,它将与两个文档中的game
相匹配。 (这样可以避免使用昂贵的正则表达式)。_
替换为空格,因此您不需要昂贵的(索引中的n-grams)和简单的匹配查询将获取结果。should
更改为must
。答案 1 :(得分:1)
NGrams比通配符具有更好的性能。对于通配符,必须扫描所有文档以查看匹配的格式。 Ngram用小标记将文本打断。 Ex Quick Fox将根据min_gram和max_gram的大小存储为[Qui,uic,ick,Fox,oxe,xes]。
I would appreciate if some one could provide another alternative too apart from this approach in R
X_Train <- as(X_train, "dgCMatrix")
GS_LogLoss = data.frame("Rounds" = numeric(),
"Depth" = numeric(),
"r_sample" = numeric(),
"c_sample" = numeric(),
"minLogLoss" = numeric(),
"best_round" = numeric())
for (rounds in seq(50,100, 25)) {
for (depth in c(4, 6, 8, 10)) {
for (r_sample in c(0.5, 0.75, 1)) {
for (c_sample in c(0.4, 0.6, 0.8, 1)) {
for (imb_scale_pos_weight in c(5, 10, 15, 20, 25)) {
for (wt_gamma in c(5, 7, 10)) {
for (wt_max_delta_step in c(5,7,10)) {
for (wt_min_child_weight in c(5,7,10,15)) {
set.seed(1024)
eta_val = 2 / rounds
cv.res = xgb.cv(data = X_Train, nfold = 2, label = y_train,
nrounds = rounds,
eta = eta_val,
max_depth = depth,
subsample = r_sample,
colsample_bytree = c_sample,
early.stop.round = 0.5*rounds,
scale_pos_weight= imb_scale_pos_weight,
max_delta_step = wt_max_delta_step,
gamma = wt_gamma,
objective='binary:logistic',
eval_metric = 'auc',
verbose = FALSE)
print(paste(rounds, depth, r_sample, c_sample, min(as.matrix(cv.res)[,3]) ))
GS_LogLoss[nrow(GS_LogLoss)+1, ] = c(rounds,
depth,
r_sample,
c_sample,
min(as.matrix(cv.res)[,3]),
which.min(as.matrix(cv.res)[,3]))
}
}
}
}
}
}
}
}
查询
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"text":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
如果只想使用通配符,则可以搜索not_analyzed字符串。这样可以处理单词之间的空格
GET my_index/_search
{
"query": {
"match": {
"text": "little guy"
}
}
}