Elasticsearch将一个查询优先于另一个查询

时间:2018-08-08 11:20:01

标签: php html elasticsearch

我有一个PHP脚本,该脚本针对Elasticsearch运行两个查询,并在PHP / HTML页面上回显结果。这两个查询在相同的字段中搜索相同的文本,但是一个查询具有AND运算符,另一个查询使用OR运算符。

我从AND运算符收到的结果是我想首先出现的结果。 OR运算符的结果也应出现,但应出现在第一个结果之后。这似乎与脚本的当前状态无关。

脚本:

<?php
    require_once 'vendor/autoload.php';
    use Elasticsearch\ClientBuilder;
    $client = ClientBuilder::create()->setHosts(['REDACTED:9200'])->build();
    $es = $client;

    if (isset($_GET['q'])) {
        $q = $_GET['q'];
        $query = $es->search([
            'index' => 'rss',
            'size' => '30',
            'body' => [
            'query' => [
                'simple_query_string' => [
                    'fields' => ["message","title"],
                    'query' => "$q",
                    'default_operator' => 'and',
                    'minimum_should_match' => '100%'
                ],
                'simple_query_string' => [
                    'fields' => ["message","title"],
                    'query' => "$q",
                    'default_operator' => 'or',
                    'minimum_should_match' => '80%'
                ]
            ]
            ]
        ]);
    }
    if($query['hits']['max_score'] >=1 ) {
        $results = $query['hits']['hits'];
    }

    ?>
   <!doctype html> 
    <html>
    <head>  
        <meta charset="utf-8">
        <title>Søkemotor</title>
        <link rel="stylesheet" href="css/main.css">
    </head>
    <body>
        <div class="img">
            <img src="img/DigRevLogo3.png" alt="Logo" width="200" height="50" class="img">
        </div>
        <div class="search">
            <form action="index.php" method="get" autocomplete="off" class="search_form">
                <label><input type="text" name="q" placeholder="Søk her"></label>
                <label><input type="submit" value="Søk" name="s"></label>
            </form>
        </div>

        <?php
        $noresult = "Ingen resultat på søket av $q.";
        $i = 0;
        if(isset($results)) {
            foreach($results as $r) { ?>
                <div class="result">
                    <div class="title">
                        <a href="<?php echo $r['_source']['link']; ?>"><?php echo $r['_source']['title'];?></a>
                    </div>

                    <div class="message">
                        <br>
                        <?php echo $r['_source']['message'];?>
                    </div>
                    <div class="published">
                        <br>
                        <?php echo $r['_source']['published'];?>            
                    </div>

                </div>
                <div class="noresult">
                <?php 
            }
        }
        else echo "<CENTER>$noresult</CENTER>"; ?>
                </div>
    </body>
    </html>

如果我的查询是“ Apple Orange”,我的结果现在显示如下:

RESULT 1: Apple Apple
RESULT 2: Apple Orange
RESULT 3: Apple Apple Apple
RESULT 4: Orange

我想出现的是这样的:

RESULT 1: Apple Orange
RESULT 2: Apple Apple Apple
RESULT 3: Apple Apple
RESULT 4: Orange

我该怎么做?我正在使用Debian 9上安装的Elasticsearch6.3。PHP版本是7.2。我将提供是否还有其他有用的信息,但我不确定需要什么。

1 个答案:

答案 0 :(得分:0)

为简化起见,让我们将其简化为Elasticsearch查询并将其切换到match,这通常是一开始要查询的查询,然后根据需要进行更深入的研究:

DELETE fruit
PUT fruit
{
  "settings": {
    "number_of_shards": 1
  }
}
POST fruit/_doc
{
  "fruit": "Apple Apple"
}
POST fruit/_doc
{
  "fruit": "Apple Orange"
}
POST fruit/_doc
{
  "fruit": "Apple Apple Apple"
}
POST fruit/_doc
{
  "fruit": "Orange"
}
GET fruit/_search
{
  "query": {
    "match": {
      "fruit": "Apple Orange"
    }
  }
}

结果是:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.0498221,
    "hits": [
      {
        "_index": "fruit",
        "_type": "_doc",
        "_id": "oRg6HmUBs4EUCKS4dujJ",
        "_score": 1.0498221,
        "_source": {
          "fruit": "Apple Orange"
        }
      },
      {
        "_index": "fruit",
        "_type": "_doc",
        "_id": "oxg6HmUBs4EUCKS4d-hu",
        "_score": 0.87138504,
        "_source": {
          "fruit": "Orange"
        }
      },
      {
        "_index": "fruit",
        "_type": "_doc",
        "_id": "ohg6HmUBs4EUCKS4d-ga",
        "_score": 0.5062483,
        "_source": {
          "fruit": "Apple Apple Apple"
        }
      },
      {
        "_index": "fruit",
        "_type": "_doc",
        "_id": "oBg6HmUBs4EUCKS4duh-",
        "_score": 0.49042806,
        "_source": {
          "fruit": "Apple Apple"
        }
      }
    ]
  }
}

对于一般理解,分数是由BM25计算的(与旧的TF / IDF非常相似)。为什么我们得到这个特定的订单?

  • 第一个文档包含您的两个搜索词-很有道理。
  • 具有多个搜索词的文档(Apple,Apple,Apple和Apple)排名较高。
  • 为什么Orange排名高于Apple苹果Apple?因为Orange总体上比较少见(在您的所有文档中出现两次,而在苹果中出现六次)

如果您向查询中添加explain,它将实际上向您显示所有分数的计算方式:

GET fruit/_search
{
  "explain": true, 
  "query": {
    "match": {
      "fruit": "Apple Orange"
    }
  }
}

如何更改默认行为?您可以调整BM25中的一些参数。阅读blog post series on BM25,这里描述了很多概念。但是请注意,这已经是一个相当高级的主题。