PySpark Job分析销售数据

时间:2018-01-28 16:25:19

标签: apache-spark pyspark

我正在尝试在python中编写Spark作业。我有两个包含以下信息的csv文件:

文件-1)product_prices.csv

product1 10

product2 20

product3    30

File-2 Sales_information.csv

id  buyer   transaction_date    seller  sales_data
1   buyer1  2015-1-01   seller1 {"product1":12,"product2":44}
2   buyer2  2015-1-01   seller3 {"product2":12}
3   buyer1  2015-1-01   seller3 {"product3":60,"product1":42}
4   buyer3  2015-1-01   seller2 {"product2":9,"product3":2}
5   buyer3  2015-1-01   seller1 {"product2":8}

现在,在上面两个数据文件中,我想执行Spark作业找到两件事并将数据输出到csv文件

1)每个卖家的总销售额需要输出到total_sellers_sales.csv文件

`seller_id  total_sales`
`seller1        1160`

2)每个卖家的输出买家列表为sellers_buyers_list.csv,如下所示:

seller_id   buyers
seller1     buyer1, buyer3

所以任何人都可以告诉我编写Spark工作的正确方法是什么。

注意:我需要python中的代码

1 个答案:

答案 0 :(得分:0)

这是我在zeppelin 0.7.2中的pyspark代码。 首先,我手动创建了示例数据框:

from pyspark.sql.functions import *                 
from pyspark.sql import functions as F                        

products = [ ("product1", 10),  ("product2",20),  ("product3",30)] 
dfProducts = sqlContext.createDataFrame(products, ['product', 'price'])                

sales = [ (1,"buyer1", "seller1","product1",12), (1,"buyer1", "seller1","product2",44), (2,"buyer2", "seller3","product2",12),(3,"buyer1", "seller3","product3",60),(3,"buyer1", "seller3","product1",42),(4,"buyer3", "seller2","product2",9),(4,"buyer3", "seller2","product3",2),(5,"buyer3", "seller1","product2",8)]                                          
dfSales= sqlContext.createDataFrame(sales, ['id', 'buyer', 'seller','product','countVal'])

每位卖家的总销售额:

dfProducts.alias('p').join(dfSales.alias('s'),col('p.product')==col('s.product')).groupBy('s.seller').agg(F.sum(dfSales.countVal * dfProducts.price)).show()

输出: Total sales for each seller

每位卖家的买家名单:

dfSales.groupBy("seller").agg(F.collect_set("buyer")).show()

输出:Buyers list for each seller

您可以使用df.write.csv(' filename.csv')方法将结果保存为csv。

希望这有帮助。