使用spark-scala删除数据框的最后一列

时间:2018-03-29 17:56:47

标签: scala apache-spark

我有一个没有标题名称的数据框。我想删除最后一列记录,但没有传递列名。 有没有办法做到这一点?

df.drop("colname")

不是在这里传递列名,而是如何从数据帧中删除最后一列。

2 个答案:

答案 0 :(得分:1)

使用df.schema解析最后一列时使用相同的API:

df.drop(df.schema.last.name)

答案 1 :(得分:1)

scala中的另一个选项

df.drop(df.columns(df.columns.length -1)) 用于删除最后一列

df.drop(df.columns(0)) 用于删除第一列

这里是完整的示例:


  val mycsv =
    """
      ||TemperatureF|Date|timestamp|MinTemp|MaxTemp|
      ||        28.0| 01/01/2000 6:53 AM|946709580|   28.0|   37.4|
      ||        28.0| 01/01/2000 7:53 AM|946713180|   28.0|   37.4|
      ||        28.0| 01/01/2000 8:53 AM|946716780|   28.0|   37.4|
      ||        30.2|01/01/2000 10:24 PM|946765440|   30.2|   37.4|
      ||        30.9|01/01/2000 10:53 PM|946767180|   30.9|   37.4|
      ||        37.4| 01/02/2000 4:39 AM|946787940|   28.0|   37.4|
      ||        36.0| 01/02/2000 4:53 AM|946788780|   28.0|   36.0|
      ||        36.0| 01/02/2000 5:53 AM|946792380|   28.0|   36.0|
    """.stripMargin('|').lines.toList.toDS()
  val df = spark.read.option("header", true).option("sep", "|").option("inferSchema", true).csv(mycsv)
  println("original schema with first and last extra columns")
  df.printSchema

  val afterfirstAndLastDF =  df
    .drop(df.columns(df.columns.length - 1)) // drop last column
    .drop(df.columns(0)) // drop first column
  afterfirstAndLastDF.show()
  afterfirstAndLastDF.printSchema

结果:



original schema with first and last extra columns
root
 |-- _c0: string (nullable = true)
 |-- TemperatureF: double (nullable = true)
 |-- Date: string (nullable = true)
 |-- timestamp: integer (nullable = true)
 |-- MinTemp: double (nullable = true)
 |-- MaxTemp: double (nullable = true)
 |-- _c6: string (nullable = true)

+------------+-------------------+---------+-------+-------+
|TemperatureF|               Date|timestamp|MinTemp|MaxTemp|
+------------+-------------------+---------+-------+-------+
|        28.0| 01/01/2000 6:53 AM|946709580|   28.0|   37.4|
|        28.0| 01/01/2000 7:53 AM|946713180|   28.0|   37.4|
|        28.0| 01/01/2000 8:53 AM|946716780|   28.0|   37.4|
|        30.2|01/01/2000 10:24 PM|946765440|   30.2|   37.4|
|        30.9|01/01/2000 10:53 PM|946767180|   30.9|   37.4|
|        37.4| 01/02/2000 4:39 AM|946787940|   28.0|   37.4|
|        36.0| 01/02/2000 4:53 AM|946788780|   28.0|   36.0|
|        36.0| 01/02/2000 5:53 AM|946792380|   28.0|   36.0|
+------------+-------------------+---------+-------+-------+

root
 |-- TemperatureF: double (nullable = true)
 |-- Date: string (nullable = true)
 |-- timestamp: integer (nullable = true)
 |-- MinTemp: double (nullable = true)
 |-- MaxTemp: double (nullable = true)