难于进行Elasticsearch批量导入?

时间:2018-06-24 14:58:45

标签: elasticsearch scrapy

我正在尝试将某些数据输出为elasticsearch批量导入格式。这需要两行JL,如下所示:

{"index": {"_type": "media", "_id": "https://macaulaylibrary.org/asset/75247", "_index": "audiomnia_dev"}}
{"description": "Macaulay Library ML75247; aracari sp.; Pteroglossus sp.; \u00a9\u00a0Curtis Marantz; Lago Sachavacaya Trail, right bank Rio Tambopata, Madre de Dios, Peru; 23 Aug 1994", "creator": "Curtis Marantz", "url": "https://macaulaylibrary.org/asset/75247", "image": "https://macaulaylibrary.org/media/Spectrograms/audio/image/320/0/75/75247.jpg", "commonName": "aracari sp.", "fileFormat": "audio", "sciName": "Pteroglossus sp.", "dateCreated": "1994-08-23T08:13:00", "geo": {"lat": "-12.9", "lon": "-69.3667"}, "contentLocation": "Lago Sachavacaya Trail, right bank Rio Tambopata, Madre de Dios, Peru", "name": "ML75247 aracari sp. Macaulay Library"}

在Scrapy中有可靠的方法吗?我有以下内容,但是发生了竞态条件,在某些情况下,它弄乱了行的顺序,这导致Elasticsearch批量API阻塞了:

yield { "index" : {
    "_index" : "audiomnia_dev",
    "_type" : "media",
    "_id" : json_ld["url"] }
}
yield json_ld

在仍然遵循生成器/产量模式的同时确保两行jl在一起的正确方法是什么?

1 个答案:

答案 0 :(得分:2)

让Spider产生具有所有相关数据的单个对象,并编写自定义item exporter以对其进行格式化以进行Elasticsearch。