我正在尝试在python中编写Spark作业。我有两个包含以下信息的csv文件:
文件-1)product_prices.csv
product1 10
product2 20
product3 30
File-2 Sales_information.csv
id buyer transaction_date seller sales_data
1 buyer1 2015-1-01 seller1 {"product1":12,"product2":44}
2 buyer2 2015-1-01 seller3 {"product2":12}
3 buyer1 2015-1-01 seller3 {"product3":60,"product1":42}
4 buyer3 2015-1-01 seller2 {"product2":9,"product3":2}
5 buyer3 2015-1-01 seller1 {"product2":8}
现在,在上面两个数据文件中,我想执行Spark作业找到两件事并将数据输出到csv文件
1)每个卖家的总销售额需要输出到total_sellers_sales.csv文件
`seller_id total_sales`
`seller1 1160`
2)每个卖家的输出买家列表为sellers_buyers_list.csv,如下所示:
seller_id buyers
seller1 buyer1, buyer3
所以任何人都可以告诉我编写Spark工作的正确方法是什么。
注意:我需要python中的代码
答案 0 :(得分:0)
这是我在zeppelin 0.7.2中的pyspark代码。 首先,我手动创建了示例数据框:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
products = [ ("product1", 10), ("product2",20), ("product3",30)]
dfProducts = sqlContext.createDataFrame(products, ['product', 'price'])
sales = [ (1,"buyer1", "seller1","product1",12), (1,"buyer1", "seller1","product2",44), (2,"buyer2", "seller3","product2",12),(3,"buyer1", "seller3","product3",60),(3,"buyer1", "seller3","product1",42),(4,"buyer3", "seller2","product2",9),(4,"buyer3", "seller2","product3",2),(5,"buyer3", "seller1","product2",8)]
dfSales= sqlContext.createDataFrame(sales, ['id', 'buyer', 'seller','product','countVal'])
每位卖家的总销售额:
dfProducts.alias('p').join(dfSales.alias('s'),col('p.product')==col('s.product')).groupBy('s.seller').agg(F.sum(dfSales.countVal * dfProducts.price)).show()
输出: Total sales for each seller:
每位卖家的买家名单:
dfSales.groupBy("seller").agg(F.collect_set("buyer")).show()
输出:Buyers list for each seller
您可以使用df.write.csv(' filename.csv')方法将结果保存为csv。
希望这有帮助。