Question

我在spark中有一个非常宽的数据帧。它有80列，所以我想将一列设置为0，其余设置为1。

所以我想将其设置为1，我尝试过

df = df.withColumn("set_zero_column", lit(0))

它奏效了。

现在我想将其余列设置为1。如何不指定所有79个名称？

感谢您的帮助

Answer 1

将select用于列表理解：

from pyspark.sql.functions import lit

set_one_columns = [lit(1).alias(c) for c in df.columns if c != "set_zero_column"]
df = df.select(lit(0).alias("set_zero_column"), *set_one_columns)

如果您需要保持原始列顺序，则可以执行以下操作：

cols = [lit(0).alias(c) if c == "set_zero_column" else lit(1).alias(c) for c in df.columns]
df = df.select(*cols)

Answer 2

我尝试用 EmployeeNotFoundException 回答：

Scala

Example:

Method1:

//sample dataframe val df=Seq(("a",1)).toDF("id","id1") //filter req columns and add literal value val cls=df.columns.map(x => if (x != "id1") (x,lit("1")) else (x,lit("0"))) //use foldLeft and add columns dynamically val df2=cls.foldLeft(df){(df,cls) => df.withColumn(cls._1,cls._2)}

Result:

df2.show() +---+---+ | id|id1| +---+---+ | 1| 0| +---+---+ Pault方法：）

Method2:

val cls=df.columns.map( x => if (x !="id1") lit(1).alias(s"${x}") else lit(0).alias(s"${x}"))

Result:

Answer 3

我仍然不熟悉sql，虽然它可能不是处理这种情况的最有效方法，但是如果有帮助或可以进一步改进，我将感到很高兴，这就是我在Java中能够做到的方式。

第一步：创建Sparksession和将文件加载到daframe中。代码：

public void process() throws AnalysisException {
SparkSession session = new SparkSession.Builder()
.appName("Untyped Agregation on data frame")
.master("local")
.getOrCreate();

//Load the file that you need to compute. 

Dataset<Row> peopledf = session.read()
.option("header","true")
.option("delimiter"," ")
.csv("src/main/resources/Person.txt");

输出：

+--------+---+--------+
|    name|age|property|
+--------+---+--------+
|  Gaurav| 27|       1|
| Dheeraj| 30|       1|
|  Saloni| 26|       1|
|  Deepak| 30|       1|
|      Db| 25|       1|
|Praneeth| 24|       1|
|   jyoti| 26|       1|
+--------+---+--------+

Step2（可选）：

如果您需要为任一列提供恒定值。

代码：

//incase you require to chnage value for a single column.
Dataset<Row> peopledf1 = peopledf.withColumn("property",lit("0"));
peopledf1.show();

output:
+--------+---+--------+
|    name|age|property|
+--------+---+--------+
|  Gaurav| 27|       0|
| Dheeraj| 30|       0|
|  Saloni| 26|       0|
|  Deepak| 30|       0|
|      Db| 25|       0|
|Praneeth| 24|       0|
|   jyoti| 26|       0|
+--------+---+--------+

Step3：

获取数据框中所有列名称的String数组。代码：

//Get the list of all the coloumns
String[] myStringArray = peopledf1.columns();

Step4：逻辑，用于过滤您不想为其提供恒定值的数组中的列，并为withColumns创建所需列名称和lit（“ constsnt”）的列表

代码：

        //create two list one bieng names of columns you need to compute
        //other bieng same size(same number of element as that of column list)of 
        //lit("0") i.e constant
        //filter out the coloumn that you dont want to apply constant upon.
        List<String> myList = new ArrayList<String>();
        List<Column> myList1 = new ArrayList<Column>();
        for(String element : myStringArray){
            if(!(element.contains("name"))){
                myList.add(element);
                myList1.add(lit("0"));
            }
        }

Step5：将list转换为Scala Seq，因为withColumns方法需要该参数的格式。代码：

    //convert both list into scala Seq<Columns> and Seq<String> respectively.
    //Need to do this because withColumns method requires arguments in Seq form.
    //check scala doc for with columns
    Seq<Column> mySeq1 = convertListToSeq(myList1);
    Seq<String> mySeq= convertListToSeq1(myList);

使用JavaConverters的convertListToSeq的代码：

   //Use JavaConverters to Convert List to Scala Seq using provided method below
    public Seq<String> convertListToSeq1(List<String> inputList) {
    return 
   JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().
   toSeq();
   }

   public Seq<Column> convertListToSeq(List<Column> inputList) {
   return JavaConverters.asScalaIteratorConverter(inputList.iterator())
   .asScala().toSeq();
    }

Step6：将输出打印到控制台代码：

    //Display the required output on console.
    peopledf1.withColumns(mySeq,mySeq1).show();

输出：

+--------+---+--------+
|    name|age|property|
+--------+---+--------+
|  Gaurav|  0|       0|
| Dheeraj|  0|       0|
|  Saloni|  0|       0|
|  Deepak|  0|       0|
|      Db|  0|       0|
|Praneeth|  0|       0|
|   jyoti|  0|       0|
+--------+---+--------+

如果可以进一步改进代码，请发表评论。

学习愉快，高拉夫

如何在不指定所有列名的情况下将多个列值更改为常量？

3 个答案: