Kindest,
我正在尝试扩展一些有效的HIVE查询,但似乎不尽如人意。只是想测试GROUP BY功能,这是我需要完成的许多查询所共有的。这是我试图执行的查询:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp)) AS hourly
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hourly;
这会产生以下错误:
错误:执行Hive脚本时出错。查询返回非零代码: 10,原因:FAILED:语义分析错误:第1行:320无效 表别名或列引用'每小时' :(可能的列名是: messagerowid,payload_sensor,messagetimestamp,payload_temp, payload_timestamp,payload_timestampmysql,payload_watt, payload_wattseconds)
目的是最终得到一个时间限制的查询(比如说最后24小时),在REST()上填写payload_wattsecond等。为了开始打破然后创建摘要表,我开始按查询构建一个组将导出select查询的每小时锚点。
问题虽然是上面的错误。非常感谢任何有关此处错误的指示..似乎无法自己找到它,但我再次成为HIVE的新手。
提前致谢..
更新:尝试更新查询。这是我刚刚尝试运行的查询:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp))
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hour(from_unixtime(payload_timestamp));
..但是会产生另一个错误,即:
ERROR: Error while executing Hive script.Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Line 1:7 Expression not in GROUP BY key 'messageRowID'
思想?
更新#2)以下是一些快速转储,这些样本派生到WSO2BAM中的EVENT_KS CF中。最后一列是计算的(在perl守护进程中......)#watt_seconds,它将在查询中用于计算总计为kwH的总和,然后将其转储到MySQL表中以同步到保存ui的应用程序/ ux layer ..
[12:03:00] [jskogsta@enterprise ../Product Centric Opco Modelling]$ ~/local/apache-cassandra-2.0.8/bin/cqlsh localhost 9160 -u admin -p admin --cqlversion="3.0.5"
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 1.2.13 | CQL spec 3.0.5 | Thrift protocol 19.36.2]
Use HELP for help.
cqlsh> use "EVENT_KS";
cqlsh:EVENT_KS> select * from currentcost_stream limit 5;
key | Description | Name | Nick_Name | StreamId | Timestamp | Version | payload_sensor | payload_temp | payload_timestamp | payload_timestampmysql | payload_watt | payload_wattseconds
-------------------------------------------+---------------------------+--------------------+----------------------+---------------------------+---------------+---------+----------------+--------------+-------------------+------------------------+--------------+---------------------
1403365575174::10.11.205.218::9443::9919 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403365575174 | 1.0.18 | 1 | 13.16 | 1403365575 | 2014-06-21 23:46:15 | 6631 | 19893
1403354553932::10.11.205.218::9443::2663 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403354553932 | 1.0.18 | 1 | 14.1 | 1403354553 | 2014-06-21 20:42:33 | 28475 | 0
1403374113341::10.11.205.218::9443::11852 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403374113341 | 1.0.18 | 1 | 10.18 | 1403374113 | 2014-06-22 02:08:33 | 17188 | 154692
1403354501924::10.11.205.218::9443::1894 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403354501924 | 1.0.18 | 1 | 10.17 | 1403354501 | 2014-06-21 20:41:41 | 26266 | 0
1403407054092::10.11.205.218::9443::15527 | Sample data from CC meter | currentcost.stream | Currentcost Realtime | currentcost.stream:1.0.18 | 1403407054092 | 1.0.18 | 1 | 17.16 | 1403407054 | 2014-06-22 11:17:34 | 6332 | 6332
(5 rows)
cqlsh:EVENT_KS>
我将尝试做的是针对此表发出查询(实际上是根据所需的各种表示聚合倍数...),并提供基于每小时总和的视图,10分钟总和,每日金额,每月金额等等。根据查询,GROUP BY旨在给出这个'索引'可以这么说。现在只是测试这个...所以将看到它最终如何结束。希望这有意义吗?!
所以不要试图删除重复...
更新3)这一切都是错误的......并且在下面给出的提示上再多想一想。因此,只是简化整个查询给出了正确的结果。以下查询为WHOLE数据集每小时提供kwH的总量。有了这个,我可以创建各种时间段(如
)花费的各种kwH迭代次数..等等。
以下是查询:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1",
"cassandra.port" = "9160",
"cassandra.ks.name" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"cassandra.cf.name" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select hour(from_unixtime(payload_timestamp)) AS hourly, (sum(payload_wattseconds)/(60*60)/1000)
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
GROUP BY hour(from_unixtime(payload_timestamp));
此查询根据示例数据生成以下内容:
hourly _c1
0 16.91570472222222
1 16.363228888888887
2 15.446414166666667
3 11.151388055555556
4 18.10564666666667
5 2.2734924999999997
6 17.370668055555555
7 17.991484444444446
8 38.632728888888884
9 16.001440555555554
10 15.887023888888889
11 12.709341944444445
12 23.052629722222225
13 14.986092222222222
14 16.182284722222224
15 5.881564999999999
18 2.8149172222222223
19 17.484405
20 15.888274166666665
21 15.387210833333333
22 16.088641666666668
23 16.49990916666667
这是整个数据集的每小时时间框架的总kwH ..
所以,现在谈到下一个问题。 ;-)