无法将维基百科json.gz批量上传到elasticsearch

时间:2017-11-26 05:09:32

标签: elasticsearch

我在线跟踪示例,将json.gz维基百科转储导入elasticsearch:https://www.elastic.co/blog/loading-wikipedia

执行以下

curl -s 'https://'$site'/w/api.php?action=cirrus-mapping-dump&format=json&formatversion=2' |
 jq .content |
 sed 's/"index_analyzer"/"analyzer"/' |
 sed 's/"position_offset_gap"/"position_increment_gap"/' |
curl -XPUT $es/$index/_mapping/page?pretty -d @-

我收到错误:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "Unknown Similarity type [arrays] for field [category]"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "Unknown Similarity type [arrays] for field [category]"
  },
  "status" : 400
}

有人有任何想法吗?我无法使用所描述的方法摄取维基百科内容。希望公司至少更新他们的教程页面。

1 个答案:

答案 0 :(得分:0)

如果查看uri,变量formatversion=2表示映射基于弹性2.x.我建议你:

  • 手动下载生产弹性搜索索引的wiki转储。 http://dumps.wikimedia.org/other/cirrussearch/current/

  • 根据您的需要创建映射,更改弹性5.x中不推荐使用的功能。例如:

    {
     "mappings": {
     "page": {
        "properties": {
           "auxiliary_text": {
          "type": "text"
       },
       "category": {
          "type": "text"
       },
       "coordinates": {
          "properties": {
             "coord": {
                "properties": {
                   "lat": {
                      "type": "double"
                   },
                   "lon": {
                      "type": "double"
                   }
                }
             },
             "country": {
                "type": "text"
             },
             "dim": {
                "type": "long"
             },
             "globe": {
                "type": "text"
             },
             "name": {
                "type": "text"
             },
             "primary": {
                "type": "boolean"
             },
             "region": {
                "type": "text"
             },
             "type": {
                "type": "text"
             }
          }
       },
       "defaultsort": {
          "type": "boolean"
       },
       "external_link": {
          "type": "text"
       },
       "heading": {
          "type": "text"
       },
       "incoming_links": {
          "type": "long"
       },
       "language": {
          "type": "text"
       },
       "namespace": {
          "type": "long"
       },
       "namespace_text": {
          "type": "text"
       },
       "opening_text": {
          "type": "text"
       },
       "outgoing_link": {
          "type": "text"
       },
       "popularity_score": {
          "type": "double"
       },
       "redirect": {
          "properties": {
             "namespace": {
                "type": "long"
             },
             "title": {
                "type": "text"
             }
          }
       },
       "score": {
          "type": "double"
       },
       "source_text": {
          "type": "text"
       },
       "template": {
          "type": "text"
       },
       "text": {
          "type": "text"
       },
       "text_bytes": {
          "type": "long"
       },
       "timestamp": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
       },
       "title": {
          "type": "text"
       },
       "version": {
          "type": "long"
       },
       "version_type": {
          "type": "text"
       },
       "wiki": {
          "type": "text"
       },
       "wikibase_item": {
          "type": "text"
           }
        }
     }
     }
    }
    
  • 创建索引后 - 在此示例中为enwiki - 您只需键入:

    zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'