在PIG中加入数据后的Multipy

时间:2015-08-09 01:55:28

标签: join apache-pig

我正在尝试将两个字段相乘并在加入Pig中的三个表后获取它们的总和。但是我继续收到这个错误:

<file loyalty_program.pig, line 30, column 74> (Name: Multiply Type: null Uid: null)incompatible types in Multiply Operator left hand side:bag :tuple(new_details1::new_details::potential_customers::num_of_orders:long) right hand side:bag :tuple(products::price:int)

-- load the data sets
orders = LOAD '/dualcore/orders' AS (order_id:int,
             cust_id:int,
             order_dtm:chararray);

details = LOAD '/dualcore/order_details' AS (order_id:int,
             prod_id:int);

products = LOAD '/dualcore/products' AS (prod_id:int,
             brand:chararray,
             name:chararray,
             price:int,
             cost:int,
             shipping_wt:int);
recent = FILTER orders by order_dtm matches '2012-.*$';

customer = GROUP recent by cust_id;

cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;

potential_customers = FILTER cust_orders by num_of_orders>=5;

new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
--DESCRIBE new_details2;

final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt;

grouped_data = GROUP final_details by cust_id;

member = FOREACH grouped_data GENERATE SUM(final_details.num_of_orders * final_details.price)  ; 
lim = limit member 10;
dump lim; 

我甚至将count的结果转换为int。它仍然继续向我抛出这个错误。我不知道如何去做。

1 个答案:

答案 0 :(得分:0)

好的..我认为,首先,您想要将购买数量与每种产品的价格相乘,然后您需要该乘积值的总和。

即使这是一个奇怪的要求,但你可以采用以下方法..

您需要做的就是计算final_details Foreach语句本身的乘法,并简单地将SUM应用于该乘法量。

根据你的加载语句,我创建了以下输入文件

main_orders.txt

6666,100,2012-01-01
7777,101,2012-09-02
8888,100,2012-01-09
9999,101,2012-12-08
6666,101,2012-09-02
9999,100,2012-07-12
9999,100,2012-08-01
6666,100,2012-01-02
7777,100,2012-09-09

orders_details.txt

6666,6000
7777,7000
8888,8000
9999,9000

main_products.txt

6000,Nike,Shoes,3000,3000,1
7000,Adidas,Cap,1000,1000,1
8000,Rebook,Shoes,4000,4000,1
9000,Puma,Shoes,25000,2500,1

以下是代码

orders = LOAD '/user/cloudera/inputfiles/main_orders.txt'  USING PigStorage(',') AS (order_id:int,cust_id:int,order_dtm:chararray);

details = LOAD '/user/cloudera/inputfiles/orders_details.txt'  USING PigStorage(',') AS (order_id:int,prod_id:int);

products = LOAD '/user/cloudera/inputfiles/main_products.txt' USING PigStorage(',') AS(prod_id:int,brand:chararray,name:chararray,price:int,cost:int,shipping_wt:int);

recent = FILTER orders by order_dtm matches '2012-.*';

customer = GROUP recent by cust_id;

cust_orders = FOREACH customer GENERATE group as cust_id, (int)COUNT(recent) as num_of_orders;


potential_customers = FILTER cust_orders by num_of_orders>=5;

new_details = join potential_customers by cust_id, recent by cust_id;
new_details1 = join new_details by order_id, details by order_id;
new_details2 = join new_details1 by prod_id, products by prod_id;
DESCRIBE new_details2;

final_details = FOREACH new_details2 GENERATE potential_customers::cust_id, potential_customers::num_of_orders as num_of_orders,recent::order_id as order_id,recent::order_dtm,details::prod_id,products::brand,products::name,products::price as price,products::cost,products::shipping_wt, (potential_customers::num_of_orders * products::price ) as multiplied_price;// multiplication is achived in last variable
dump final_details;

 grouped_data = GROUP final_details by cust_id;

member = FOREACH grouped_data GENERATE SUM(final_details.multiplied_price)  ; 
lim = limit member 10;
dump lim;

为了清楚起见,我也倾销了final_details foreach语句的输出。

(100,6,6666,2012-01-01,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,6666,2012-01-02,6000,Nike,Shoes,3000,3000,1,18000)
(100,6,7777,2012-09-09,7000,Adidas,Cap,1000,1000,1,6000)
(100,6,8888,2012-01-09,8000,Rebook,Shoes,4000,4000,1,24000)
(100,6,9999,2012-07-12,9000,Puma,Shoes,25000,2500,1,150000)
(100,6,9999,2012-08-01,9000,Puma,Shoes,25000,2500,1,150000)

最终输出低于

(366000)

此代码可能会对您有所帮助,但请再次澄清您的要求