Presto Cassandra表现缓慢

时间:2017-06-22 15:59:36

标签: performance presto cassandra-3.0

我正在使用presto来查询Cassandra记录,它需要大约8分钟来响应结果。需要改善响应时间。

以下Presto配置:

   coordinator=true
   node-scheduler.include-coordinator=false
   http-server.http.port=8080
   query.max-memory=5GB
   query.max-memory-per-node=3GB
   discovery-server.enabled=true
   discovery.uri=http://URL:8080
   task.max-worker-threads=10
   task.concurrency=32 

   Worker : 4

   coordinator=false
   http-server.http.port=8080
   query.max-memory=5GB
   query.max-memory-per-node=2GB
   discovery.uri=http://URL:8080
   task.max-worker-threads=16
   task.concurrency=32

   Cassandra : 4 NODE 

片段2              成本:CPU 1.98m,输入:17833912行(1.49GB),输出:13089502行(1.31GB)
     ScanFilterProject [table = cassandra:cassandra:rasapp:raslog,originalConstraint =((“bucketid”= CAST('2017062113'                      费用:96.12%,输入:23169736行(22.10MB),输出:17833912行(1.49GB),已过滤:23.03%

如何改善presto的响应时间仍然使用拥有约2300万条记录的分区密钥?

CREATE TABLE TEST.TEST_LOG (
  bucketId              varchar,
  id                    timeuuid,
  transaction_id        varchar,
  ras_transaction_id    varchar,
  msg_seq_id            int,
  host_name             varchar,
  matip_channel_id      varchar,
  hth_id                varchar,
  mq_id                 varchar,
  log_point             varchar,
  entry_time            timestamp,
  exit_time             timestamp,
  source_carrier        varchar,
  destination_carrier   varchar,
  source_dcs            varchar,
  destination_dcs       varchar,
  message_type          varchar,
  message_direction     int,
  error_code_business   varchar,
  exception_code        varchar,
  exception_description varchar,
  scenario              varchar,
  created_date          timestamp,
  huborcar              varchar,
  noof_fanout           varchar,
  flight_date           timestamp,
  route_origin          varchar,
  route_destination     varchar,
  class_service         varchar,
  no_of_seats           varchar,
  ras_host              varchar,
  cp_host               varchar,
  PRIMARY KEY(bucketid, created_date, msg_seq_id,message_direction,scenario,source_dcs,exception_code,log_point,transaction_id,id)
) WITH default_time_to_live = 2851200 and CLUSTERING ORDER BY (created_date ASC, msg_seq_id ASC,message_direction ASC,scenario ASC,source_dcs ASC,exception_code ASC,log_point ASC,transaction_id ASC,id ASC);

查询

select
transaction_id,
message_direction,
message_type,
max(exception_code) as exception_code,
min(entry_time) as min_entry,
max(entry_time) as max_entry,
min(exit_time) as min_exit,
max(exit_time) as max_exit
from TEST.TEST_LOG
where bucketid='2017062113'
and (
((msg_seq_id<=2 and message_type='PAOREQ'  ) or
( msg_seq_id>2 and message_type='PAORES'  )))
group by transaction_id,
message_direction,
message_type

所花费的时间:8分钟

谢谢,

1 个答案:

答案 0 :(得分:0)

两件事:Presto的0.180版本将包括对群集键的不等式谓词的下推,这将有助于您的查询。此外,您的架构不适用于您正在运行的查询。在Cassandra中,最好是a)查询特定分区(你做的),并且按照你使用它们的顺序对集群键进行谓词(因为这是Cassandra使用的排序顺序)。如果您的主键为(bucketid,message_type,msg_seq_id,...),您可能会看到更好的性能。

此外,Presto不会将聚合下推到Cassandra(或任何连接器),因此如果您正在聚合大量数据,并且您不需要Presto进行联合查询,则可能会更快在Cassandra中进行查询。