在我的情况下,我必须滚动阅读Elastic的100万到200万个json结果。然而,从结果中构建熊猫数据帧(10万条记录,约10秒钟)是相当缓慢的。在下面列出我的代码:
while (scroll_size > 0):
frame = pd.DataFrame.from_dict([document['_source'] for document in page["hits"]["hits"]])
frame['L7P'] = frame['L7P'].astype('category')
appended_data.append(frame)
page = es.scroll(scroll_id = sid, scroll = '1m', request_timeout = 30)
# Update the scroll ID
sid = page['_scroll_id']
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])