Question

在我的情况下，我必须滚动阅读Elastic的100万到200万个json结果。然而，从结果中构建熊猫数据帧（10万条记录，约10秒钟）是相当缓慢的。在下面列出我的代码：

while (scroll_size > 0):
    frame = pd.DataFrame.from_dict([document['_source'] for document in page["hits"]["hits"]])
    frame['L7P'] = frame['L7P'].astype('category')
    appended_data.append(frame)
    page = es.scroll(scroll_id = sid, scroll = '1m', request_timeout = 30)
    # Update the scroll ID
    sid = page['_scroll_id']
    # Get the number of results that we returned in the last scroll
    scroll_size = len(page['hits']['hits'])

Answer 1

熊猫并不是大型数据集的速度恶魔。如果您想要更快的速度，请使用Datatable.

如何从弹性滚动结果更快地建立熊猫数据框？

1 个答案: