火花减速器和求和结果问题

时间:2017-08-17 05:30:06

标签: scala apache-spark

这是示例文件

Department,Designation,costToCompany,State

    Sales,Trainee,12000,UP
    Sales,Lead,32000,AP
    Sales,Lead,32000,LA
    Sales,Lead,32000,TN
    Sales,Lead,32000,AP
    Sales,Lead,32000,TN 
    Sales,Lead,32000,LA
    Sales,Lead,32000,LA
    Marketing,Associate,18000,TN
    Marketing,Associate,18000,TN
    HR,Manager,58000,TN

以csv

生成输出
  • 逐个部门,设计,国家

  • 使用sum(costToCompany)和sum(TotalEmployeeCount)的其他列

结果应该是

Dept,Desg,state,empCount,totalCost
Sales,Lead,AP,2,64000
Sales,Lead,LA,3,96000
Sales,Lead,TN,2,64000

以下是解决方案,写入文件会导致错误。我在这里做错了什么?

步骤1:加载文件

val file = sc.textFile("data/sales.txt")

步骤2:创建一个案例类来代表数据

scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
defined class emp

步骤3:拆分数据并创建emp对象的RDD

scala> val fileSplit = file.map(_.split(","))
scala> val data = fileSplit.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))

步骤#4:将数据转换为Key / value par = key =(dept,desg,state)和value =(1,totalCost)

scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))

步骤#5:使用reduceByKey进行分组,因为我们想要求总数以及员工总数和成本

scala> val results = keyVals.reduceByKey{(a,b) => (a._1+b._1, a._2+b._2)} //(a.count+ b.count, a.cost+b.cost)
results: org.apache.spark.rdd.RDD[((String, String, String), (Int, Double))] = ShuffledRDD[41] at reduceByKey at <console>:55

步骤#6:保存结果

scala> results.repartition(1).saveAsTextFile("data/result")

错误

17/08/16 22:16:59 ERROR executor.Executor: Exception in task 0.0 in stage 20.0 (TID 23)
java.lang.NumberFormatException: For input string: "costToCompany"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)
    at java.lang.Double.parseDouble(Double.java:540)
    at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
    at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
    at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
    at $line85.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:51)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
17/08/16 22:16:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 20.0 (TID 23, localhost, executor driver): java.lang.NumberFormatException: For input string: "costToCompany"

更新1 忘了删除标题。在这里更新代码。 Save现在抛出一个不同的错误。另外,需要将标题放回文件中。

scala> val file = sc.textFile("data/sales.txt")
scala> val header = fileSplit.first()
scala> val noHeaderData = fileSplit.filter(_(0) != header(0))
scala> case class emp(Dept:String, Desg:String, totalCost:Double, State:String)
scala> val data = noHeaderData.map(x => emp(x(0), x(1), x(2).toDouble, x(3)))
scala> val keyVals = data.map(x => ((x.Dept,x.Desg,x.State),(1,x.totalCost)))
scala> val resultSpecific = results.map(x => (x._1._1, x._1._2, x._1._3, x._2._1, x._2._2))
scala> resultSpecific.repartition(1).saveASTextFile("data/specific")
<console>:64: error: value saveASTextFile is not a member of org.apache.spark.rdd.RDD[(String, String, String, Int, Double)]
          resultSpecific.repartition(1).saveASTextFile("data/specific")

3 个答案:

答案 0 :(得分:4)

回答你的问题和评论:

在这种情况下,您可以更轻松地使用数据框,因为您的文件采用csv格式,您可以使用以下方式加载和保存数据。通过这种方式,您无需关心拆分文件中的行以及处理标题(加载和保存时)。

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = spark.read
        .format("com.databricks.spark.csv")
        .option("header", "true") //reading the headers
        .load("csv/file/path");

数据帧列名称将与文件中的标题相同。您可以使用数据框reduceByKey()groupBy()代替agg()

val res = df.groupBy($"Department", $"Designation", $"State")
  .agg(count($"costToCompany").alias("empCount"), sum($"costToCompany").alias("totalCost"))

然后保存:

res.coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("results.csv")

答案 1 :(得分:2)

当你试图加入double时,costToCompany字符串不会强制转换为什么它在尝试触发动作时卡住了。只需从文件中删除第一条记录,然后就可以了。你也可以在数据帧上进行这样的操作,这很容易

答案 2 :(得分:0)

错误是直截了当的,它说

  

:64:错误:值saveASTextFile不是其成员   org.apache.spark.rdd.RDD [(String,String,String,Int,Double)]             resultSpecific.repartition(1).saveASTextFile( “数据/特定”)

事实上,您没有名为saveASTextFile(...)saveAsTextFile(???)的方法。您的方法名称有大小写错误。