我正在对露天飞行数据的开源历史数据集(7TB)进行研究分析。文档[1]:https://opensky-network.org/data/impala。
表结构具有以下结构:
[hadoop-2:21000] > describe state_vectors_data4;
Query: describe state_vectors_data4
+---------------+------------+-----------------------------+
| name | type | comment |
+---------------+------------+-----------------------------+
| time | int | Inferred from Parquet file. |
| icao24 | string | Inferred from Parquet file. |
| lat | double | Inferred from Parquet file. |
| lon | double | Inferred from Parquet file. |
| velocity | double | Inferred from Parquet file. |
| heading | double | Inferred from Parquet file. |
| vertrate | double | Inferred from Parquet file. |
| callsign | string | Inferred from Parquet file. |
| onground | boolean | Inferred from Parquet file. |
| alert | boolean | Inferred from Parquet file. |
| spi | boolean | Inferred from Parquet file. |
| squawk | string | Inferred from Parquet file. |
| baroaltitude | double | Inferred from Parquet file. |
| geoaltitude | double | Inferred from Parquet file. |
| lastposupdate | double | Inferred from Parquet file. |
| lastcontact | double | Inferred from Parquet file. |
| serials | array<int> | Inferred from Parquet file. |
| hour | int | |
+---------------+------------+-----------------------------+
一个示例如下:
[hadoop-2:21000] > select * from state_vectors_data4 where hour=1480759200 order by rand() limit 1;
Query: select * from state_vectors_data4 where hour=1480759200 order by rand() limit 1
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
| time | icao24 | lat | lon | velocity | heading | vertrate | callsign | onground | alert | spi | squawk | baroaltitude | geoaltitude | lastposupdate | lastcontact | hour |
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
| 1480760792 | a0d724 | 37.89463883739407 | -88.93331113068955 | 190.8504039695975 | 265.8263544365708 | 0 | UPS858 | false | false | false | 7775 | 9144 | 9342.120000000001 | 1480760704.359 | 1480760709.997 | 1480759200 |
+------------+--------+-------------------+--------------------+-------------------+-------------------+----------+----------+----------+-------+-------+--------+-------------------+-------------------+----------------+----------------+------------+
Fetched 1 row(s) in 1.48s
添加了小时列,以加快查询速度。
我使用以下查询来基于时间(时间/小时)和位置(纬度/经度)进行过滤,同时以180((time / 180 AS INT))的速率进行重采样:
SELECT sv.time,
sv.icao24,
sv.callsign,
sv.hour
FROM state_vectors_data4 sv
JOIN
(
SELECT CAST (time / 180 AS INT) AS minute,
MAX(time) AS recent,
icao24
FROM state_vectors_data4
WHERE hour BETWEEN 1483228800 AND 1485907200 AND
time BETWEEN 1483228800 AND 1485907200
GROUP BY icao24,
minute
)
AS m ON sv.icao24 = m.icao24 AND
sv.time = m.recent
WHERE lat BETWEEN 50.1 AND 54.1 AND
lon BETWEEN 2.2 AND 8.2 AND
hour BETWEEN 1483228800 AND 1485907200 AND
time BETWEEN 1483228800 AND 1485907200;
但是,查询非常慢,有没有更快的方法来获得相同的结果?
编辑: 添加了EXPLAIN字符串:
+---------------------------------------------------------------------------------------------------------------------------+
| Explain String |
+---------------------------------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=104.00MB |
| Per-Host Resource Estimates: Memory=3.37GB |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| opensky.state_vectors_data4 |
| |
| PLAN-ROOT SINK |
| | |
| 07:EXCHANGE [UNPARTITIONED] |
| | |
| 03:HASH JOIN [INNER JOIN, BROADCAST] |
| | hash predicates: sv.time = max(time), sv.icao24 = icao24 |
| | runtime filters: RF000 <- max(time), RF001 <- icao24 |
| | |
| |--06:EXCHANGE [BROADCAST] |
| | | |
| | 05:AGGREGATE [FINALIZE] |
| | | output: max:merge(time) |
| | | group by: icao24, CAST(time / 180 AS INT) |
| | | having: max(time) <= 1485907200, max(time) >= 1483228800 |
| | | |
| | 04:EXCHANGE [HASH(icao24,CAST(time / 180 AS INT))] |
| | | |
| | 02:AGGREGATE [STREAMING] |
| | | output: max(time) |
| | | group by: icao24, CAST(time / 180 AS INT) |
| | | |
| | 01:SCAN HDFS [opensky.state_vectors_data4] |
| | partitions=745/28054 files=745 size=109.64GB |
| | predicates: time <= 1485907200, time >= 1483228800 |
| | |
| 00:SCAN HDFS [opensky.state_vectors_data4 sv] |
| partitions=745/28054 files=745 size=109.64GB |
| predicates: sv.lat <= 54.1, sv.lat >= 50.1, sv.lon <= 8.2, sv.lon >= 2.2, sv.time <= 1485907200, sv.time >= 1483228800 |
| runtime filters: RF000 -> sv.time, RF001 -> sv.icao24 |
+---------------------------------------------------------------------------------------------------------------------------+
[hadoop-29:21000] >