我有一个如下所示的数据集
hduser@ubuntu:~$ hadoop fs -cat /user/hduser/test_sample/sample1.txt
Eid1,EName1,EDept1,100
Eid2,EName2,EDept1,102
Eid3,EName3,EDept1,101
Eid4,EName4,EDept2,110
Eid5,EName5,EDept2,121
Eid6,EName6,EDept3,99
我想使用spark code
生成如下输出Eid1,EName1,IT,102,1
Eid2,EName2,IT,101,2
Eid3,EName3,IT,100,3
Eid4,EName4,ComSc,121,1
Eid5,EName5,ComSc,110,2
Eid6,EName6,Mech,99,1
等同于以下SQL
选择emp_id,emp_name,emp_dept ='EDept1'时的情况,然后是'IT',当emp_dept ='EDept2'然后是'ComSc'当emp_dept ='EDept3'然后'Mech'结束dept_name,emp_sal,row_number()over(partition通过emp_sal desc的emp_dept命令)作为来自emp的rn
有人可以建议我应该如何解决这个问题。
答案 0 :(得分:1)
您可以使用RDD.zipWithIndex,然后将其转换为DataFrame,然后使用min()和join来获得所需的结果。
像这样:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// SORT BY added as per comment request
val test = sc.textFile("/user/hadoop/test.txt")
.sortBy(_.split(",")(2)).sortBy(_.split(",")(3).toInt)
// Table to hold the dept name lookups
val deptDF =
sc.parallelize(Array(("EDept1", "IT"),("EDept2", "ComSc"),("EDept3", "Mech")))
.toDF("deptCode", "dept")
val schema = StructType(Array(
StructField("col1", StringType, false),
StructField("col2", StringType, false),
StructField("col3", StringType, false),
StructField("col4", StringType, false),
StructField("col5", LongType, false))
)
// join to deptDF added as per comment
val testDF = sqlContext.createDataFrame(
test.zipWithIndex.map(tuple => Row.fromSeq(tuple._1.split(",") ++ Array(tuple._2))),
schema
)
.join(deptDF, $"col3" === $"deptCode")
.select($"col1", $"col2", $"dept" as "col3", $"col4", $"col5")
.orderBy($"col5")
testDF.show
col1 col2 col3 col4 col5
Eid1 EName1 IT 100 0
Eid3 EName3 IT 101 1
Eid2 EName2 IT 102 2
Eid4 EName4 ComSc 110 3
Eid5 EName5 ComSc 121 4
Eid6 EName6 Mech 99 5
val result = testDF.join(
testDF.groupBy($"col3").agg($"col3" as "g_col3", min($"col5") as "start"),
$"col3" === $"g_col3"
)
.select($"col1", $"col2", $"col3", $"col4", $"col5" - $"start" + 1 as "index")
result.show
col1 col2 col3 col4 index
Eid4 EName4 ComSc 110 1
Eid5 EName5 ComSc 121 2
Eid6 EName6 Mech 99 1
Eid1 EName1 IT 100 1
Eid3 EName3 IT 101 2
Eid2 EName2 IT 102 3