我要执行以下四个任务,但对于如何将两个数据集合并以使任何任务正常工作感到困惑...
A)查询交易次数最少的客户名称,并输出客户名称和交易数量。
B)使用广播(复制)加入客户和交易。报表:CustomerID,姓名,薪水,NumOf交易,TotalSum,MinItems(其中NumOfTransactions是该客户完成的交易总数,TotalSum是该客户的“ TransTotal”字段的总和,而MinItems是其中的最小项目数由客户完成的交易。)
C)报告客户数量大于5,000或小于2,000的国家/地区代码。
D)假设我们要按以下方式设计数据分析任务:Age属性分为六个组,分别是[10,20),[20,30),[30,40),[40 ,50),[50、60)和[60、70]。在上述每个年龄段内,根据“性别”进行进一步划分,即将6个年龄段中的每个年龄段进一步分为两组。每个小组报告:年龄范围,性别,MinTransTotal,MaxTransTotal,AvgTransTotal。注意:方括号“ [”表示包含范围的下限,其中“)”表示不包含范围的上限。
这就是我的开始:
hadoop fs -mkdir /piginput
sudo hadoop fs -put customer.txt /piginput
sudo hadoop fs -put transaction.txt /piginput
sudo hadoop fs -put transaction_small.txt /piginput
pig
customers = LOAD '/piginput/customers.txt' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,CountryCode:int,salary:float);
transactions = LOAD '/piginput/transaction.txt' USING PigStorage(',') as (trans_id:int, id:int, age:int, total:float, num_items:int, description:chararray);
alldata = JOIN customers BY id, transactions BY id;
by_clusters_terms_count = FOREACH alldata COUNT(id);
会产生错误:
ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
Failed to parse: Pig script failed to parse:
<line 4, column 26> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:196)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1684)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1657)
at org.apache.pig.PigServer.registerQuery(PigServer.java:600)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1069)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:542)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by:
<line 4, column 26> pig script failed to validate: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1041)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15870)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 15 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1031: Incompatable schema: left is "id:NULL,name:NULL,num_items:NULL", right is "customers::id:int"
at org.apache.pig.newplan.logical.relational.LogicalSchema.merge(LogicalSchema.java:760)
at org.apache.pig.newplan.logical.relational.LOGenerate.getSchema(LOGenerate.java:158)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:123)
at org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:245)
at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:114)
at org.apache.pig.parser.LogicalPlanBuilder.buildForeachOp(LogicalPlanBuilder.java:1039)
... 21 more
有什么想法吗?我是否错误地加入数据集导致问题?
答案 0 :(得分:0)
customers = LOAD 'hdfs://hadoop-VirtualBox:8020/piginput/customer.txt' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,CountryCode:int,salary:float);
A = foreach customers generate id, name;
transactions = LOAD 'hdfs://hadoop-VirtualBox:8020/piginput/transaction_small.txt' USING PigStorage(',') as (trans_id:int, cust_id:int, total:float, num_items:int, description:chararray);
B = foreach transactions generate cust_id,num_items;
alldata = JOIN A BY id, B BY cust_id;
C = GROUP alldata by $0;
这最终解决并解决了问题