我想加载XML Wikipedia转储,例如: http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20171001/enwiki-20171001-pages-articles.xml.bz2 进入Elasticsearch(5.6.4)。 但是,我发现的所有工具和教程都已过时,与我的Elasticsearch版本不兼容。 任何人都可以解释将转储导入Elasticsearch的最佳方法是什么?
答案 0 :(得分:3)
两年前,维基媒体已经提供了生产弹性研究指数的转储。
每周都会导出索引,每个维基都有两个导出。
The content index, which contains only article pages, called content;
The general index, containing all pages. This includes talk pages, templates, etc, called general;
你可以在http://dumps.wikimedia.org/other/cirrussearch/current/
找到它们根据您的需要创建映射。例如:
{
"mappings": {
"page": {
"properties": {
"auxiliary_text": {
"type": "text"
},
"category": {
"type": "text"
},
"coordinates": {
"properties": {
"coord": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"country": {
"type": "text"
},
"dim": {
"type": "long"
},
"globe": {
"type": "text"
},
"name": {
"type": "text"
},
"primary": {
"type": "boolean"
},
"region": {
"type": "text"
},
"type": {
"type": "text"
}
}
},
"defaultsort": {
"type": "boolean"
},
"external_link": {
"type": "text"
},
"heading": {
"type": "text"
},
"incoming_links": {
"type": "long"
},
"language": {
"type": "text"
},
"namespace": {
"type": "long"
},
"namespace_text": {
"type": "text"
},
"opening_text": {
"type": "text"
},
"outgoing_link": {
"type": "text"
},
"popularity_score": {
"type": "double"
},
"redirect": {
"properties": {
"namespace": {
"type": "long"
},
"title": {
"type": "text"
}
}
},
"score": {
"type": "double"
},
"source_text": {
"type": "text"
},
"template": {
"type": "text"
},
"text": {
"type": "text"
},
"text_bytes": {
"type": "long"
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"title": {
"type": "text"
},
"version": {
"type": "long"
},
"version_type": {
"type": "text"
},
"wiki": {
"type": "text"
},
"wikibase_item": {
"type": "text"
}
}
}
}
}
创建索引后,只需输入:
zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'
享受!
答案 1 :(得分:0)
我尝试了多种导入维基百科的方法。我发现有两种方法可以使用Logstash并直接编写python编码器。