Question

我想一次性从数据框中删除多个cols。不想写.drop（“col1”）。drop（“col2”）。

注意：我使用的是spark-1.6.0

Answer 1

此功能在当前的spark版本（2.0以后版本）中可用，对于早期版本，我们可以使用以下代码。

1

    implicit class DataFrameOperation(df: DataFrame) {
            def dropCols(cols: String*): DataFrame = {
               @tailrec def deleteCol(df: DataFrame, cols: Seq[String]): DataFrame = 
               if(cols.size == 0) df else deleteCol(df.drop(cols.head), cols.tail)
        deleteCol(df, cols)
}
}

调用方法

val finalDF = dataFrame.dropCols("col1","col2","col3")

Answer 2

这种方法是一种解决方法。

public static DataFrame drop(DataFrame dataFrame, List<String> dropCol) {
    List<String> colname = Arrays.stream(dataFrame.columns()).filter(col -> !dropCol.contains(col)).collect(Collectors.toList());
    // colname list will have the names of the cols except the ones to be dropped.
    return dataFrame.selectExpr(JavaConversions.asScalaBuffer(colname));
}

inputDataFrame：

+---+---+---+---+---+
| C0| C1| C2| C3| C4|
+---+---+---+---+---+
|  0|  0|  0|  0|  1|
|  1|  5|  6|  0| 14|
|  1|  6|  1|  0|  3|
|  1|  0|  1|  0|  1|
|  1| 37|  9|  0| 19|
+---+---+---+---+---+

如果要删除C0，C2，C4列，

colDroppedDataFrame：

+---+---+
| C1| C3|
+---+---+
|  0|  0|
|  5|  0|
|  6|  0|
|  0|  0|
| 37|  0|
+---+---+

在SPARK＆lt; =版本1.6.0中递归地从DataFrame中删除多个列

2 个答案: