Question

我是Scala和Spark的新手，有人可以在Scala代码下方进行优化，以查找每年学生获得的最高分数

val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}


Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98

输出：

2017,raghu,288
2017,rajesh,288
2016,raj,278

Answer 1

我不确定“ Optimized”到底是什么意思，但是更“ scala-y”和“ spark-y”的实现方式可能如下：

import org.apache.spark.sql.expressions.Window

// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")

// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))

// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.

// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))

// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show

topStudents.show

这将在Spark-shell中产生以下输出：

+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016|    raj|  278.0|   1|
|2017|  raghu|  288.0|   1|
|2017| rajesh|  288.0|   1|
+----+-------+-------+----+

如果您需要根据自己的问题显示CSV，则可以使用：

topStudents.collect.map(_.mkString(",")).foreach(println)

产生：

2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1

我已将过程分为几个步骤。这样，您只需在中间结果上运行show，就可以查看每个步骤的情况。例如，要查看spark.read.option ...的功能，只需将marksDF.show输入到spark-shell

由于OP想要RDD版本，因此这里是一个示例。可能不是最佳选择，但确实给出了正确的结果：

import org.apache.spark.rdd.RDD

// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)

val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)

// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
    (idx, iter) => if (idx == 0) iter.drop(1) else iter
  }.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))

// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))

// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}

// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))

// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))

// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))

// Show the final result.
dump(result)


val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))

dump(result)

以上结果为：

(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)

像以前一样，您可以简单地通过使用转储功能转储中间的RDD来查看它们。注意：转储功能需要一个RDD。如果要显示DataFrame或数据集的内容，请使用它的show方法。

可能有比上述解决方案更好的解决方案，但确实可以做到。

希望RDD版本会鼓励您使用DataFrame和/或DataSet。代码不仅更简单，而且：

Spark将评估DataFrame和DataSet并可以优化整个转换过程。 RDD并非如此（即，它们不经过优化就一个接一个地执行）。基于DataFrame和DataSet的翻译过程可能会运行得更快（假设您没有手动优化等效RDD）
数据集和数据框允许架构在不同程度上（例如命名列和数据类型）。
可以使用SQL查询数据框和数据集。
DataFrame和DataSet操作/方法更符合SQL结构
与RDD相比，DataFrame和DataSet更易于使用
数据集（和RDD）提供了编译时错误检测。
数据集是未来的方向。

查看以下两个链接以获取更多信息：

https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/ https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/ https://medium.com/@sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9

或只是谷歌“我应该使用rdd或dataframe火花”

您的项目一切顺利。

Answer 2

在SCALA火花壳上试用

scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show

它将生成下表

+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016|    raj|278.0|
|2017|  raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+

Answer 3

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
  def getSparkContext() = {
    val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
    val sc = new SparkContext(conf)
    sc
  }

  def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)

  def main(args: Array[String]): Unit = {
   // System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
    val sc = getSparkContext()

    val inpRDD = sc.textFile("marks.csv")
    val head = inpRDD.first()
    val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
    val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
    
    //marksByNameyear.cache()

    val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
    val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
    
    
    val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
    val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
    
    val tt = sc.broadcast(marksAndYear.collect().toMap)
    marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key)  } } 
    val yearMarksName = marksAndYear.leftOuterJoin(markYearName) 
    
    val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
   
   //dump(markYearName);'
   dump(result)

  }
}

每年寻找最高分数标记

3 个答案: