我们的查询需要20秒,因此我们需要大幅减少这一时间。我们通过python数据帧客户端调用它,但是我通过CLI客户端重现了相同的查询和20秒钟的响应时间:
influx --host 10.0.5.183 --precision RFC3339 -execute "select * from turbine_ops.permanent.turbine_interval where ((turbine_id = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')">/dev/null
Influx在具有EBS和通用SSD(gp2)卷的r5.large EC2实例上运行,CLI在同一子网的EC2上。该查询返回747120行,每行具有1个标签(turbine_id)和5个字段(所有十进制值)。这看起来正常吗?
通过流入主机的顶部,我发现RAM的使用没有明显变化,短暂的CPU高峰在查询开始时持续了大约1s,然后没有后续的CPU活动。
分片持续时间设置为1年。
show series exact cardinality on turbine_ops
name: turbine_interval
count
-----
11
我尝试将influxdb主机缩放到r5.8xlarge,并且查询时间完全没有改变。
explain select * from turbine_ops.permanent.turbine_interval where ((turbine_ = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')
QUERY PLAN
EXPRESSION:
AUXILIARY FIELDS: active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 10
CACHED VALUES: 0
NUMBER OF FILES: 150
NUMBER OF BLOCKS: 3515
SIZE OF BLOCKS: 12403470
explain analyze select * from turbine_ops.permanent.turbine_interval where ((turbine_ = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')
EXPLAIN ANALYZE
.
└── select
├── execution_time: 1.442047426s
├── planning_time: 2.105094ms
├── total_time: 1.44415252s
└── build_cursor
├── labels
│ └── statement: SELECT active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float FROM turbine_ops.permanent.turbine_interval WHERE turbine_ = ‘NKWF-T15’ OR turbine_id::tag = ‘NKWF-T41’ OR turbine_id::tag = ‘NKWF-T23’ OR turbine_id::tag = ‘NKWF-T19’ OR turbine_id::tag = ‘NKWF-T51’ OR turbine_id::tag = ‘NKWF-T14’ OR turbine_id::tag = ‘NKWF-T42’ OR turbine_id::tag = ‘NKWF-T26’ OR turbine_id::tag = ‘NKWF-T39’ OR turbine_id::tag = ‘NKWF-T49’ OR turbine_id::tag = ‘NKWF-T38’
└── iterator_scanner
├── labels
│ └── auxiliary_fields: active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float
└── create_iterator
├── labels
│ ├── cond: turbine_ = ‘NKWF-T15’ OR turbine_id::tag = ‘NKWF-T41’ OR turbine_id::tag = ‘NKWF-T23’ OR turbine_id::tag = ‘NKWF-T19’ OR turbine_id::tag = ‘NKWF-T51’ OR turbine_id::tag = ‘NKWF-T14’ OR turbine_id::tag = ‘NKWF-T42’ OR turbine_id::tag = ‘NKWF-T26’ OR turbine_id::tag = ‘NKWF-T39’ OR turbine_id::tag = ‘NKWF-T49’ OR turbine_id::tag = ‘NKWF-T38’
│ ├── measurement: turbine_interval
│ └── shard_id: 1584
├── cursors_ref: 0
├── cursors_aux: 50
├── cursors_cond: 0
├── float_blocks_decoded: 2812
├── float_blocks_size_bytes: 12382380
├── integer_blocks_decoded: 703
├── integer_blocks_size_bytes: 21090
├── unsigned_blocks_decoded: 0
├── unsigned_blocks_size_bytes: 0
├── string_blocks_decoded: 0
├── string_blocks_size_bytes: 0
├── boolean_blocks_decoded: 0
├── boolean_blocks_size_bytes: 0
└── planning_time: 1.624627ms
请让我知道我们可以进行的任何优化。
答案 0 :(得分:1)
当我直接卷曲HTTP API并得到大约3秒钟的响应时,我的怀疑被证实是Influx本身不是罪魁祸首。我不确定为什么CLI或python DataFrameClient会增加这么大的开销,但是我在3.78 s的时间内使用了Pandas数据框:
import urllib
import pandas as pd
from io import BytesIO
data = {}
data['db']='turbine_ops'
data['precision']='s'
data['q']="select * from turbine_ops.permanent.turbine_interval where ((turbine_id = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')"
url_values=urllib.parse.urlencode(data)
url="http://10.0.5.183:8086/query?" + url_values
request = urllib.request.Request(url, headers={'Accept':'application/csv'})
response = urllib.request.urlopen(request)
response_bytestr = response.read()
df = pd.read_csv(BytesIO(response_bytestr), sep=",")
这是一个好的开始,更快会更好,因此请提交其他解决方案。