使用python将多个列合并到pyspark数据帧中的一列中

时间:2017-06-19 09:35:55

标签: python dataframe pyspark

我需要将数据框的多个列合并为一个列,并使用list(或tuple)作为python中使用pyspark的列的值。

Input dataframe:

+-------+-------+-------+-------+-------+
| name  |mark1  |mark2  |mark3  | Grade |
+-------+-------+-------+-------+-------+
| Jim   | 20    | 30    | 40    |  "C"  |
+-------+-------+-------+-------+-------+
| Bill  | 30    | 35    | 45    |  "A"  |
+-------+-------+-------+-------+-------+
| Kim   | 25    | 36    | 42    |  "B"  |
+-------+-------+-------+-------+-------+

Output dataframe should be

+-------+-----------------+
| name  |marks            |
+-------+-----------------+
| Jim   | [20,30,40,"C"]  |
+-------+-----------------+
| Bill  | [30,35,45,"A"]  |
+-------+-----------------+
| Kim   | [25,36,42,"B"]  |
+-------+-----------------+

4 个答案:

答案 0 :(得分:1)

查看此文档:https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["mark1", "mark2", "mark3"],
    outputCol="marks")

output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)

答案 1 :(得分:0)

如果这仍然相关,您可以使用StringIndexer将字符串值编码为浮点替换。

答案 2 :(得分:0)

列可以与sparks数组函数合并:

db.collection('Users').doc(this.state.user.uid).set({
  name: `${this.state.fname} ${this.state.lname}`,
  email: this.state.email,
  time: new Date().getTime(),
  id: this.state.user.uid,
  address: this.state.place,
  username: this.state.username,
  bio: this.state.bio,
  img: this.state.imgURL,
  chats: [] //To add the chats list to the user profile
}).then(() => {
  console.log("User Info Added...");
}).catch(error => {
  console.log(error);
})

您可能需要更改条目的类型才能使合并成功

答案 3 :(得分:0)

您可以选择以下内容:

from pyspark.sql.functions import *    
df.select( 'name' ,
        concat( 
            col("mark1"), lit(","), 
            col("mark2"), lit(","), 
            col("mark3"), lit(","),
            col("Grade")
        ).alias('marks')  
    )

如果需要[],则可以添加点灯功能。

from pyspark.sql.functions import *    
df.select( 'name' ,
        concat(lit("["), 
            col("mark1"), lit(","), 
            col("mark2"), lit(","), 
            col("mark3"), lit(","),
            col("Grade"), lit("]")
        ).alias('marks')  
    )