在Python中将CSV索引到ElasticSearch

时间:2017-01-10 16:28:56

标签: python csv elasticsearch python-3.5 elasticsearch-dsl

希望将CSV文件索引到ElasticSearch,而不使用Logstash。 我正在使用dayfirst=True高级库。

给定带标题的CSV,例如:

for chunk in pd.read_csv(file, chunksize=500000, 
                         parse_dates=[['date', 'time']],  # note the extra []
                         dayfirst=True,
                         names=col_names, index_col=index_cols, 
                         header=0, dtype=dtype)
    store.append('df',chunk)

按字段索引所有数据的最佳方法是什么?最终,我希望让每一行看起来像这样

elasticsearch-dsl

2 个答案:

答案 0 :(得分:21)

对于较低级别的elasticsearch-py库,此类任务更容易:

from elasticsearch import helpers, Elasticsearch
import csv

es = Elasticsearch()

with open('/tmp/x.csv') as f:
    reader = csv.DictReader(f)
    helpers.bulk(es, reader, index='my-index', doc_type='my-type')

答案 1 :(得分:1)

如果您想使用严格的类型和模型从elasticsearch创建.tsv/.csv数据库以获得更好的过滤效果,您可以这样做:

class ElementIndex(DocType):
    ROWNAME = Text()
    ROWNAME = Text()

    class Meta:
        index = 'index_name'

def indexing(self):
    obj = ElementIndex(
        ROWNAME=str(self['NAME']),
        ROWNAME=str(self['NAME'])
    )
    obj.save(index="index_name")
    return obj.to_dict(include_meta=True)

def bulk_indexing(args):

    # ElementIndex.init(index="index_name")
    ElementIndex.init()
    es = Elasticsearch()

    //here your result dict with data from source

    r = bulk(client=es, actions=(indexing(c) for c in result))
    es.indices.refresh()