我是Scala和Spark的新手,有人可以在Scala代码下方进行优化,以查找每年学生获得的最高分数
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
输出:
2017,raghu,288
2017,rajesh,288
2016,raj,278
答案 0 :(得分:2)
我不确定“ Optimized”到底是什么意思,但是更“ scala-y”和“ spark-y”的实现方式可能如下:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
这将在Spark-shell中产生以下输出:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
如果您需要根据自己的问题显示CSV,则可以使用:
topStudents.collect.map(_.mkString(",")).foreach(println)
产生:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
我已将过程分为几个步骤。这样,您只需在中间结果上运行show,就可以查看每个步骤的情况。例如,要查看spark.read.option ...的功能,只需将marksDF.show输入到spark-shell
由于OP想要RDD版本,因此这里是一个示例。可能不是最佳选择,但确实给出了正确的结果:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
以上结果为:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
像以前一样,您可以简单地通过使用转储功能转储中间的RDD来查看它们。注意:转储功能需要一个RDD。如果要显示DataFrame或数据集的内容,请使用它的show方法。
可能有比上述解决方案更好的解决方案,但确实可以做到。
希望RDD版本会鼓励您使用DataFrame和/或DataSet。代码不仅更简单,而且:
查看以下两个链接以获取更多信息:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/ https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/ https://medium.com/@sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
或只是谷歌“我应该使用rdd或dataframe火花”
您的项目一切顺利。
答案 1 :(得分:0)
在SCALA火花壳上试用
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
它将生成下表
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+
答案 2 :(得分:0)
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}