我正在从弹性簇中读取数据,然后将其转换为熊猫数据框。我想对数据框进行一些分析,然后再次将其可视化。我想做到实时。但是我的弹性cluse响应非常缓慢,大多数情况下,我感到以下错误:-
elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=10))
我执行上述操作的代码是:-
import pandas as pd
import datetime
import elasticsearch
import elasticsearch.helpers
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from pandasticsearch import DataFrame
from pandasticsearch import Select
from elasticsearch import Elasticsearch, helpers
import os
# Define the client which will be our elastic cluster URL
client = Elasticsearch(['http://localhost:9200/'])
# Define search method on the client by using the Search function.
# make sure that the Search function start with Capital S (Search(using=client)) as this is a function.
search = Search(using=client)
# Get all the results from the search method and store it in result to know how many hits we are getting.
results = search.execute()
# To know about the total number of hits we are getting run the below chunk.
results.hits.total # 2351472834 (I got 2.3 billion hits as a result)
# Again I am defining a method s on which we will perform the query. you have to run this method everytime before running the query.
s = Search(using=client)
# add any filters/queries....
# The below line you can use if you want to dump all the data and in this case we have 2.3 billion observation.
#s = s.query({"match_all": {}})
# In the below code you can add filters,queries or time constraints.
s = s.query({"constant_score": {
"filter": {
"bool": {
"must": [{
"range": {"@timestamp": {
"gte": "2019-05-15T14:00:00.000Z", # gte - greater than
"lte": "2019-05-15T14:30:00.000Z" # lte - less than
}}
}],
"filter": [
# 1st filter, get all the data where type is "vx_apache_json"
{"term": {"type": "vx_pp_log"}},
# 2nd filter, get all the data where domain is "fnwp"
{"term": {"domain": "fnwp"}},
# 3rd filter, get all the data where RTP:a is "end"
{"term": {"RTP:a": "end"}},
]
}}}})
# After getting all the result in the variable s, we are applying scan method on it and converting it into a data frame.
results_df = pd.DataFrame((d.to_dict() for d in s.scan()))
# TO have a look at the data frame use the below name of the data frame
# results_df
results_df.to_csv('signin.csv', index=False)
我仅读取30分钟的数据,而我想在24小时或4小时内进行读取,具体取决于我的过滤器中的需求:-
"gte": "2019-05-15T14:00:00.000Z", # gte - greater than
"lte": "2019-05-15T14:30:00.000Z" # lte - less than
答案 0 :(得分:1)
由于在无法访问elasticsearch的情况下很难优化搜索查询,因此我只能通过增加超时来告诉您如何处理ReadTimeout错误。
client = Elasticsearch(['http://localhost:9200/'],timeout = 60,max_retries = 10,retry_on_timeout = True)