Question

将Elasticsearch JDBC导入程序与此配置一起使用：

bin=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/bin
lib=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/lib
echo '{
    "type" : "jdbc",
    "jdbc" : {
        "url" : "ip/db",
        "user" : "myuser",
        "password" : "a7sdf7hsdf8hn78df",
        "sql" : "SELECT title, body, source_id, time_order, type, blablabla...",
        "index" : "importeditems",
        "type" : "item",
        "elasticsearch.host": "_eth0_",
        "detect_json" : false
    }
}' | java \
       -cp "${lib}/*" \
       -Dlog4j.configurationFile=${bin}/log4j2.xml \
       org.xbib.tools.Runner \
       org.xbib.tools.JDBCImporter

我使用以下格式正确索引了一些文档：

{
"title":"Tiempo de Opinión: Puede comenzar un ciclo",
"body":"Sebas Álvaro nos trae cada lunes historias y anécdotas de la montaña<!-- com -->",
"source_id":21188,
"time_order":"1438638043:55c2c6bb96d4c"
"type":"rss"
}

我正在尝试忽略重音（例如，标题中的opinión有ó），因此如果用户搜索"tiempo de opinión"或"tiempo de opinion"使用match_phrase，它可以匹配带或不带重音的文档。

因此，在使用导入器并将所有内容编入索引后，我使用default过滤器将索引设置更改为asciifolding分析器。

curl -XPOST 'localhost:9200/importeditems/_close'

curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer" : "standard",
          "filter":  [ "lowercase", "asciifolding"]
}}}}'

curl -XPOST 'localhost:9200/importeditems/_open'

然后我制作match_phrase以匹配"tiempo de opinion"（无重音）和"tiempo de opinión"（带重音）

# No accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
{
"query": {
            "match_phrase" : {
                 "title" : "tiempo de opinion"
}}}'

# With accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
{
"query": {
            "match_phrase" : {
                 "title" : "tiempo de opinión"
}}}'

但是当它们存在时不会给出匹配（如果我match_phrase短语tiempo de它会返回一些包含tiempo de opinión的点击数。）

我认为问题是由于JDBC JDBC导入器，因为我在不使用导入器的情况下重现了错误，手动添加了另一个索引和条目，将索引设置也更改为asciifolding，一切都按预期工作。您可以在here找到此工作示例。

如果我检查使用导入器（importeditems）

后创建的索引的设置

curl -XGET 'localhost:9200/importeditems/_settings?pretty=true'

输出：

{
  "importeditems" : {
    "settings" : {
      "index" : {
        "creation_date" : "1457533907278",
        "analysis" : {
          "analyzer" : {
            "default" : {
              "filter" : [ "lowercase", "asciifolding" ],
              "tokenizer" : "standard"
            }
          }
        },
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "x",
        "version" : {
          "created" : "2010199"
}}}}

...如果我检查手动创建的索引（test）的设置：

curl -XGET 'localhost:9200/test/_settings?pretty=true'

我得到了完全相同的输出：

 {
  "test" : {
    "settings" : {
      "index" : {
        "creation_date" : "1457603253278",
        "analysis" : {
          "analyzer" : {
            "default" : {
              "filter" : [ "lowercase", "asciifolding" ],
              "tokenizer" : "standard"
            }
          }
        },
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "x",
        "version" : {
          "created" : "2010199"
    }}}}

如果我使用Elasticsearch JDBC Importer，有人可以告诉我为什么不工作？如果我添加原始数据，为什么它可以工作？

Answer 1

我终于通过首先通过添加settings模块来更改analysis来解决问题：

curl -XPOST 'localhost:9200/importeditems/_close'

curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer" : "standard",
          "filter":  [ "lowercase", "asciifolding"]
}}}}'

curl -XPOST 'localhost:9200/importeditems/_open'

...然后再次导入所有数据。

它是一个额外的，因为正如我在帖子中所说的那样，在两种情况下（使用JDBC Importer和原始数据）我都做了完全相同的事情：

索引数据
更改索引settings
使用match_phrase

它适用于原始数据（test），而不是我使用导入器（importeditems）的数据。我唯一能想到的是importeditems超过12GB，需要时间来重新创建内容asciifolding。这就是为什么在asciifolding 激活之后，更改没有反映出来的原因。

无论如何，如果有人遇到同样的问题，特别是那些使用大量数据的人，请先记住设置分析器，然后将所有数据编入索引。

根据docs：

查询只能找到倒排索引中实际存在的术语，所以确保应用相同的分析过程非常重要索引时的文档和搜索时的查询字符串时间使查询中的术语与倒置中的术语匹配索引。

使用Elasticsearch JDBC Importer后，'asciifolding'无法按预期工作

1 个答案: