我需要将数据框的多个列合并为一个列,并使用list(或tuple)作为python中使用pyspark的列的值。
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
答案 0 :(得分:1)
查看此文档:https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)
答案 1 :(得分:0)
如果这仍然相关,您可以使用StringIndexer将字符串值编码为浮点替换。
答案 2 :(得分:0)
列可以与sparks数组函数合并:
db.collection('Users').doc(this.state.user.uid).set({
name: `${this.state.fname} ${this.state.lname}`,
email: this.state.email,
time: new Date().getTime(),
id: this.state.user.uid,
address: this.state.place,
username: this.state.username,
bio: this.state.bio,
img: this.state.imgURL,
chats: [] //To add the chats list to the user profile
}).then(() => {
console.log("User Info Added...");
}).catch(error => {
console.log(error);
})
您可能需要更改条目的类型才能使合并成功
答案 3 :(得分:0)
您可以选择以下内容:
from pyspark.sql.functions import *
df.select( 'name' ,
concat(
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade")
).alias('marks')
)
如果需要[],则可以添加点灯功能。
from pyspark.sql.functions import *
df.select( 'name' ,
concat(lit("["),
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade"), lit("]")
).alias('marks')
)