我是Apache Spark和Scala的新手,目前正在学习大数据的框架和编程语言。我有一个样本文件,我试图找出给定字段的另一个字段的总数,以及它的计数和另一个字段的值列表。我自己尝试过,似乎我并没有以更好的方式编写火花rdd
(作为开始)。
请找到以下示例数据(Customerid: Int, Orderid: Int, Amount: Float)
:
44,8602,37.19
35,5368,65.89
2,3391,40.64
47,6694,14.98
29,680,13.08
91,8900,24.59
70,3959,68.68
85,1733,28.53
53,9900,83.55
14,1505,4.32
51,3378,19.80
42,6926,57.77
2,4424,55.77
79,9291,33.17
50,3901,23.57
20,6633,6.49
15,6148,65.53
44,8331,99.19
5,3505,64.18
48,5539,32.42
我当前的代码:
((sc.textFile("file://../customer-orders.csv").map(x => x.split(",")).map(x => (x(0).toInt,x(1).toInt)).map{case(x,y) => (x, List(y))}.reduceByKey(_ ++ _).sortBy(_._1,true)).
fullOuterJoin(sc.textFile("file://../customer-orders.csv").map(x =>x.split(",")).map(x => (x(0).toInt,x(2).toFloat)).reduceByKey((x,y) => (x + y)).sortBy(_._1,true))).
fullOuterJoin(sc.textFile("file://../customer-orders.csv").map(x =>x.split(",")).map(x => (x(0).toInt)).map(x => (x,1)).reduceByKey((x,y) => (x + y)).sortBy(_._1,true)).sortBy(_._1,true).take(50).foreach(println)
得到这样的结果:
(49,(Some((Some(List(8558, 6986, 686....)),Some(4394.5996))),Some(96)))
预期结果如下:
customerid, (orderids,..,..,....), totalamount, number of orderids
有没有更好的方法?我只是用下面的代码尝试了combineByKey
,但是里面的println
没有打印出来。
scala> val reduced = inputrdd.combineByKey(
| (mark) => {
| println(s"Create combiner -> ${mark}")
| (mark, 1)
| },
| (acc: (Int, Int), v) => {
| println(s"""Merge value : (${acc._1} + ${v}, ${acc._2} + 1)""")
| (acc._1 + v, acc._2 + 1)
| },
| (acc1: (Int, Int), acc2: (Int, Int)) => {
| println(s"""Merge Combiner : (${acc1._1} + ${acc2._1}, ${acc1._2} + ${acc2._2})""")
| (acc1._1 + acc2._1, acc1._2 + acc2._2)
| }
| )
reduced: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[27] at combineByKey at <console>:29
scala> reduced.collect()
res5: Array[(String, (Int, Int))] = Array((maths,(110,2)), (physics,(214,3)), (english,(65,1)))
我正在使用Spark版本2.2.0,Scala 2.11.8和Java 1.8 build 101
答案 0 :(得分:1)
使用更新的 DataFrame API更容易解决。首先阅读csv文件并添加列名称:
val df = spark.read.csv("file://../customer-orders.csv").toDF("Customerid", "Orderid", "Amount")
然后使用groupBy
和agg
进行聚合(在这里要collect_list
,sum
和count
):
val df2 = df.groupBy("Customerid").agg(
collect_list($"Orderid") as "Orderids",
sum($"Amount") as "TotalAmount",
count($"Orderid") as "NumberOfOrderIds"
)
使用提供的输入示例生成的数据框:
+----------+------------+-----------+----------------+
|Customerid| Orderids|TotalAmount|NumberOfOrderIds|
+----------+------------+-----------+----------------+
| 51| [3378]| 19.8| 1|
| 15| [6148]| 65.53| 1|
| 29| [680]| 13.08| 1|
| 42| [6926]| 57.77| 1|
| 85| [1733]| 28.53| 1|
| 35| [5368]| 65.89| 1|
| 47| [6694]| 14.98| 1|
| 5| [3505]| 64.18| 1|
| 70| [3959]| 68.68| 1|
| 44|[8602, 8331]| 136.38| 2|
| 53| [9900]| 83.55| 1|
| 48| [5539]| 32.42| 1|
| 79| [9291]| 33.17| 1|
| 20| [6633]| 6.49| 1|
| 14| [1505]| 4.32| 1|
| 91| [8900]| 24.59| 1|
| 2|[3391, 4424]| 96.41| 2|
| 50| [3901]| 23.57| 1|
+----------+------------+-----------+----------------+
如果要在这些转换后将数据作为RDD使用,则可以在以后进行转换:
val rdd = df2.as[(Int, Seq[Int], Float, Int)].rdd
当然,也可以直接使用RDD进行求解。使用aggregateByKey
:
val rdd = spark.sparkContext
.textFile("test.csv")
.map(x => x.split(","))
.map(x => (x(0).toInt, (x(1).toInt, x(2).toFloat)))
val res = rdd.aggregateByKey((Seq[Int](), 0.0, 0))(
(acc, xs) => (acc._1 ++ Seq(xs._1), acc._2 + xs._2, acc._3 + 1),
(acc1, acc2) => (acc1._1 ++ acc2._1, acc1._2 + acc2._2, acc1._3 + acc2._3))
这很难读,但会得到与上述数据框方法相同的结果。