Spark Dataframe选择列使用案例

时间:2017-08-10 07:17:52

标签: scala apache-spark dataframe apache-spark-sql spark-dataframe

我想实现以下内容 例如我有Emp文件(2个文件) 我想只选择2列,例如Empid和EmpName,如果文件没有EmpName,它应该选择一列Empid数据帧

1)Emp1.csv(文件)

Empid   EmpName Dept
1       ABC     IS
2       XYZ     COE

2)Emp.csv(文件)

 Empid  EmpName
 1      ABC
 2      XYZ

代码一直试用到现在

scala>  val SourceData = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("delimiter", ",").option("header", "true").load("/root/Empfiles/")
SourceData: org.apache.spark.sql.DataFrame = [Empid: string, EmpName: string ... 1 more field]

scala> SourceData.printSchema
root
|-- Empid: string (nullable = true)
|-- EmpName: string (nullable = true)
|-- Dept: string (nullable = true)

如果指定文件

的所有列名,则此代码有效
 scala> var FormatedColumn = SourceData.select(
 |             SourceData.columns.map {
| case "Empid"                     => SourceData("Empid").cast(IntegerType).as("empid")
 | case "EmpName"                     => SourceData("EmpName").cast(StringType).as("empname")
 | case "Dept"                     => SourceData("Dept").cast(StringType).as("dept")
 | }: _*
 | )
 FormatedColumn: org.apache.spark.sql.DataFrame = [empid: int, empname: string ... 1 more field]

但我只想要特定的2列失败(如果列可用则显示select并更改数据类型和列名称)

 scala> var FormatedColumn = SourceData.select(
 | SourceData.columns.map {
 | case "Empid"                     => SourceData("Empid").cast(IntegerType).as("empid")
 | case "EmpName"                     => SourceData("EmpName").cast(StringType).as("empname")
 | }: _*
 | )
 scala.MatchError: Dept (of class java.lang.String)
 at $anonfun$1.apply(<console>:32)
 at $anonfun$1.apply(<console>:32)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  ... 53 elided

3 个答案:

答案 0 :(得分:1)

所有其他列也需要匹配:

var formattedColumn = sourceData.select(
  sourceData.columns.map {
      case "Empid"   => sourceData("Empid").cast(IntegerType).as("empid")
      case "EmpName" => sourceData("EmpName").cast(StringType).as("empname")
      case other: String => sourceData(other)
  }: _*
)

更新1 。如果你只想选择两列&#34; Empid&#34;和&#34; EmpName&#34;,没有必要使用匹配器:

val formattedColumn = sourceData.select(
  sourceData("Empid").cast(IntegerType).as("empid"),
  sourceData("EmpName").cast(StringType).as("empname")
)

更新2 。如果您想根据它们的存在选择列,我可以建议以下内容:

val colEmpId = "Empid"
val colEmpName = "EmpName"
// list of possible expected column names
val selectableColums = Seq(colEmpId, colEmpName)
// take only the ones that are in the list
val foundColumns = sourceData.columns.filter(column => selectableColums.contains(column))
// create the target dataframe
val formattedColumn = sourceData.select(
  foundColumns.map(column =>
    column match {
      case colEmpId   => sourceData(colEmpId).cast(IntegerType).as("empid")
      case colEmpName => sourceData(colEmpName).cast(StringType).as("empname")
      case _ => throw new IllegalArgumentException("Unexpected column: " + column)
    }
  ): _*
)

P.S。请使用valvar s的常规camelCase名称。

答案 1 :(得分:0)

如果用此查询替换语句,它应该有效。 它会筛选出不属于match子句的所有列。这可以避免您看到的MatchError。

df.select($"Empid", $"EmpName").select(df.columns.map {
    case "Empid" => df("Empid").cast(IntegerType).as("empid")
    case "EmpName" => df("EmpName").cast(StringType).as("empname")
}: _*)

答案 2 :(得分:0)

我不确定为什么会这么复杂。

为什么不这样做?

df
  .withColumn("empid", $"EmpId".cast(IntegerType))
  .withColumn("empname", $"EmpName".cast(StringType))