Question

我试图使用Spark DataSet来加载相当大的数据（让我们说），其中子集数据如下所示。

|age|maritalStatus|    name|sex|
+---+-------------+--------+---+
| 35|            M|  Joanna|  F|
| 25|            S|Isabelle|  F|
| 19|            S|    Andy|  M|
| 70|            M|  Robert|  M|
+---+-------------+--------+---+

我需要进行关系转换，其中一列从其他列派生其值。例如，基于＆＃34; age＆＃34; ＆安培; ＆＃34;性别＆＃34;每个人的记录，我需要把先生或女士/女士放在每个人的名字前面＃34;属性。另一个例子是，对于年龄为＃34岁的人来说超过60岁，我需要将他或她标记为老年人（派生专栏＆＃34; seniorCitizen＆＃34;作为Y）。

我对转换数据的最终需求如下：

+---+-------------+---------------------------+---+
|age|maritalStatus|         name|seniorCitizen|sex|
+---+-------------+---------------------------+---+
| 35|            M|  Mrs. Joanna|            N|  F|
| 25|            S| Ms. Isabelle|            N|  F|
| 19|            S|     Mr. Andy|            N|  M|
| 70|            M|   Mr. Robert|            Y|  M|
+---+-------------+--------+------------------+---+

Spark提供的大多数转换都是静态的，而不是动态的。例如，如示例here和here中所定义。

我使用Spark数据集是因为我是从关系数据源加载的，但是如果您可以建议使用普通RDD更好的方法，请执行此操作。

Answer 1

您可以使用withColumn为使用seniorCitizen子句的where添加新列，为了更新name，您可以使用用户定义的函数(udf)如下

import spark.implicits._

import org.apache.spark.sql.functions._
//create a dummy data 
val df = Seq((35, "M", "Joanna", "F"),
    (25, "S", "Isabelle", "F"),
    (19, "S", "Andy", "M"),
    (70, "M", "Robert", "M")
  ).toDF("age", "maritalStatus", "name", "sex")

// create a udf to update name according to age and sex
val append = udf((name: String, maritalStatus:String, sex: String) => {
  if (sex.equalsIgnoreCase("F") &&  maritalStatus.equalsIgnoreCase("M")) s"Mrs. ${name}"
  else if (sex.equalsIgnoreCase("F")) s"Ms. ${name}"
  else s"Mr. ${name}"
})

//add two new columns with using withColumn  
df.withColumn("name", append($"name", $"maritalStatus", $"sex"))
  .withColumn("seniorCitizen", when($"age" < 60, "N").otherwise("Y")).show

输出：

+---+-------------+------------+---+-------------+
|age|maritalStatus|        name|sex|seniorCitizen|
+---+-------------+------------+---+-------------+
| 35|            M| Mrs. Joanna|  F|            N|
| 25|            S|Ms. Isabelle|  F|            N|
| 19|            S|    Mr. Andy|  M|            N|
| 70|            M|  Mr. Robert|  M|            Y|
+---+-------------+------------+---+-------------+

修改

以下是不使用UDF的输出

df.withColumn("name", when($"sex" === "F", when($"maritalStatus" === "M", concat(lit("Ms. "), df("name"))).otherwise(concat(lit("Ms. "), df("name")))) .otherwise(concat(lit("Ms. "), df("name")))) .withColumn("seniorCitizen", when($"age" < 60, "N").otherwise("Y"))

希望这有帮助！

Answer 2

Spark functions可以帮助您完成工作。您可以合并when，concat，lit功能，如下所述

val updateName = when(lower($"maritalStatus") === "m" && lower($"sex") === "f", concat(lit("Mrs. "), $"name"))
                      .otherwise(when(lower($"maritalStatus") === "s" && lower($"sex") === "f", concat(lit("Ms. "), $"name"))
                      .otherwise(when(lower($"sex") === "m", concat(lit("Mr. "), $"name"))))

val updatedDataSet = dataset.withColumn("name", updateName)
  .withColumn("seniorCitizen", when($"age" > 60, "Y").otherwise("N"))

updatedDataSet是您所需的dataset

Spark中的关系转换

2 个答案: