Question

我有一个索引，有效地转换后的单词或pdf文档纯文本“document_texts”，构建在Rails堆栈上，ActiveModel是DocumentText，使用elasticsearch rails gems，用于模型和API。我希望能够根据文档文本

匹配类似的word文档或pdf

我已经能够使用

将文档相互匹配

response = DocumentText.search \
  query: {
      filtered: {
          query: {
              more_like_this: {
                  ids: ["12345"]
              }
          }
      }
  }

但是我想看看如何查询结果集，用于匹配文档的查询术语是什么

使用elasticsearch API gem我可以执行以下操作

 client=Elasticsearch::Client.new log:true

 client.indices.validate_query index: 'document_texts',
    explain: true,
    body: {
      query: {
          filtered: {
              query: {
                  more_like_this: {
                      ids: ['12345']
                  }
              }
          }
      }
   }

但我得到了回复

{"valid":true,"_shards":{"total":1,"successful":1,"failed":0},"explanations":[{"index":"document_texts","valid":true,"explanation":"+(like:null -_uid:document_text#12345)"}]}

我想知道查询是如何构建的，它最多使用25个术语进行匹配，25个术语是什么以及如何从查询中获取它们？

我不确定它是否可行，但我想知道我是否可以获得elasticsesearch分析器使用的25个术语，然后根据我的选择在术语上重新应用提升值。

我还想在文档文本中强调这一点，但尝试了这个

response = DocumentText.search \
  from: 0, size: 25,
  query: {
      filtered: {
          query: {
              more_like_this: {
                  ids: ["12345"]
              }
          },
          filter: {
              bool: {
                  must: [                            
                      {match: { documentable_type: model}}
                 ]
              }
          }

      }
  },
  highlight: {
    pre_tags: ["<tag1>"],
    post_tags: ["</tag1>"],
    fields: {
        doc_text: {
                type_name: {
                content: {term_vector: "with_positions_offsets"}
            }
        }
    }
  }

但这没有产生任何东西，我想我很有希望。我知道这应该是可能的，但是很想知道是否有人做过这个或最好的方法。有什么想法吗？

Answer 1

仅为其他人包含一些停用词，这将为其提供一种简单的方法来显示查询使用的术语。它没有解决突出问题，但可以给出用于mlt匹配过程的术语。其他一些设置仅用于显示

client.indices.validate_query index: 'document_texts',
  rewrite: true,
  explain: true,
  body: {
    query: {
        filtered: {
            query: {
                more_like_this: {
                    ids: ['10538']
                }
            }
        }
    }
 }

https://github.com/elastic/elasticsearch-ruby/pull/359

一旦合并，这应该更容易

import pandas as pd
import io

temp=u"""#,Job_ID,Date/Time,value1,value2,
0,ID1,05/01 24:00:00,5,6
1,ID2,05/02 24:00:00,6,15
2,ID3,05/03 24:00:00,20,21"""

dateparse = lambda x: pd.datetime.strptime(x.replace('24:','00:'), '%m/%d  %H:%M:%S')

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
    skipinitialspace=True,
    date_parser=dateparse,
    parse_dates=['Date/Time'],
    index_col=['Date/Time'],
    usecols=['Job_ID', 'Date/Time', 'value1', 'value2'],
    header=0)

print (df)
           Job_ID  value1  value2
Date/Time                        
1900-05-01    ID1       5       6
1900-05-02    ID2       6      15
1900-05-03    ID3      20      21

Elasticsearch验证API解释查询术语，更像是针对单个字段获取突出显示的术语

1 个答案: