我正在使用python库Impyla在python脚本中使用Impala从HDFS查询数据。具体数据是代理数据,并且有很多。我有一个脚本,每天运行一次以提取前一天并运行统计信息。目前,我正在为此查询使用devicereceipttime
字段,该字段存储为时间戳。
from impala.dbapi import connect
from impala.util import as_pandas
import pandas as pd
#Pull desired features from the proxy_realtime_p table
cursor.execute('select request, count(*) as count \
from default.proxy_realtime_p \
where devicereceipttime BETWEEN concat(to_date(now() - interval 1 days), " 00:00:00") and concat(to_date(now() - interval 1 days), " 23:59:59") \
group by request \
order by count desc')
此查询需要一点时间,如果可能,希望加快速度。从下面的给定字段中,我的查询最有效吗?
devicereceipttime (timestamp)
year (int)
month (int)
day (int)
hour (int)
minute (int)
seconds (int)