使用Spark Scala

时间:2017-07-08 14:40:18

标签: scala apache-spark join

如何使用以下两个数据集计算Spark Scala中每个位置的平均工资?

File1.csv(第4列是薪水)

Ram, 30, Engineer, 40000  
Bala, 27, Doctor, 30000  
Hari, 33, Engineer, 50000  
Siva, 35, Doctor, 60000

File2.csv(第2列是位置)

Hari, Bangalore  
Ram, Chennai  
Bala, Bangalore  
Siva, Chennai  

以上文件未排序。需要加入这两个文件,并找到每个位置的平均工资。我尝试使用下面的代码,但无法实现。

val salary = sc.textFile("File1.csv").map(e => e.split(","))  
val location = sc.textFile("File2.csv").map(e.split(","))  
val joined = salary.map(e=>(e(0),e(3))).join(location.map(e=>(e(0),e(1)))  
val joinedData = joined.sortByKey()  
val finalData = joinedData.map(v => (v._1,v._2._1._1,v._2._2))  
val aggregatedDF = finalData.map(e=> e.groupby(e(2)).agg(avg(e(1))))    
aggregatedDF.repartition(1).saveAsTextFile("output.txt")  

请帮助您查看代码和示例输出。

非常感谢

4 个答案:

答案 0 :(得分:2)

我会使用DataFrame API,这应该可行:

val salary = sc.textFile("File1.csv")
               .map(e => e.split(","))
               .map{case Seq(name,_,_,salary) => (name,salary)}
               .toDF("name","salary")

val location = sc.textFile("File2.csv")
                 .map(e => e.split(","))
                 .map{case Seq(name,location) => (name,location)}
                 .toDF("name","location")

import org.apache.spark.sql.functions._

salary
  .join(location,Seq("name"))
  .groupBy($"location")
  .agg(
    avg($"salary").as("avg_salary")
  )
  .repartition(1)
  .write.csv("output.csv")

答案 1 :(得分:2)

您可以将CSV文件作为DataFrames阅读,然后加入并分组以获得平均值:

.txt

[更新]用于计算平均尺寸:

val df1 = spark.read.csv("/path/to/file1.csv").toDF(
  "name", "age", "title", "salary"
)

val df2 = spark.read.csv("/path/to/file2.csv").toDF(
  "name", "location"
)

import org.apache.spark.sql.functions._

val dfAverage = df1.join(df2, Seq("name")).
  groupBy(df2("location")).agg(avg(df1("salary")).as("average")).
  select("location", "average")

dfAverage.show
+-----------+-------+
|   location|average|
+-----------+-------+
|Bangalore  |40000.0|
|  Chennai  |50000.0|
+-----------+-------+

答案 2 :(得分:1)

我会使用数据帧: 首先阅读数据框,例如:

val salary = spark.read.option("header", "true").csv("File1.csv")
val location = spark.read.option("header", "true").csv("File2.csv")

如果您没有标题,则需要将选项设置为" false"并使用withColumnRenamed来更改默认名称。

val salary = spark.read.option("header", "false").csv("File1.csv").toDF("name", "age", "job", "salary")
val location = spark.read.option("header", "false").csv("File2.csv").toDF("name", "location")

现在进行加入:

val joined = salary.join(location, "name")

最后进行平均计算:

val avg = joined.groupby("location").agg(avg($"salary"))

保存:

avg.repartition(1).write.csv("output.csv")

答案 3 :(得分:0)

你可以这样做:

averages.take(10)

然后res5: Array[(String, Int)] = Array((Chennai,50000), (Bangalore,40000)) 会给出:

enum Size
{
SMALL = 1,
MEDIUM = 5,
LARGE = 10
}

class Test {
    int mysize1 = (int)Size.SMALL;
    int mysize2 = (int)global::Size.MEDIUM;
}