我有30列。列名称中的26个是字母的名称。我想把这26列作为一个字符串分成一列。
price dateCreate volume country A B C D E ..... Z
19 20190501 25 US 1 2 5 6 19 30
49 20190502 30 US 5 4 5 0 34 50
我想要这个:
price dateCreate volume country new_col
19 20190501 25 US "1,2,5,6,19,....30"
49 20190502 30 US "5,4,5,0,34,50"
我知道我可以做这样的事情:
df.withColumn("new_col", concat($"A", $"B", ...$"Z"))
但是,将来遇到此问题时,我想知道如何更轻松地连接许多列。有办法吗?
答案 0 :(得分:3)
只需将以下内容应用于要连接的任意数量的列
val df= Seq((19,20190501,24, "US", 1 , 2, 5, 6, 19 ),(49,20190502,30, "US", 5 , 4, 5, 0, 34 )).
toDF("price", "dataCreate", "volume", "country", "A","B","C","D","E")
val exprs = df.columns.drop(4).map(col _)
df.select($"price", $"dataCreate", $"volume", $"country", concat_ws(",",
array(exprs: _*)).as("new_col"))
+-----+----------+------+-------+----------+
|price|dataCreate|volume|country| new_col|
+-----+----------+------+-------+----------+
| 19| 20190501| 24| US|1,2,5,6,19|
| 49| 20190502| 30| US|5,4,5,0,34|
+-----+----------+------+-------+----------+
出于完整性考虑,这里是pyspark等效版本
import pyspark.sql.functions as F
df= spark.createDataFrame([[19,20190501,24, "US", 1 , 2, 5, 6, 19 ],[49,20190502,30, "US", 5 , 4, 5, 0, 34 ]],
["price", "dataCreate", "volume", "country", "A","B","C","D","E"])
exprs = [col for col in df.columns[4:]]
df.select("price","dataCreate", "volume", "country", F.concat_ws(",",F.array(*exprs)).alias("new_col"))
答案 1 :(得分:2)
也许您会想到与下一个类似的东西:
斯卡拉
import org.apache.spark.sql.functions.{col, concat_ws}
val cols = ('A' to 'Z').map{col(_)}
df.withColumn("new_col", concat_ws(",", cols:_*)
Python
from pyspark.sql.functions import col, concat_ws
import string
cols = [col(x) for x in string.ascii_uppercase]
df.withColumn("new_col", concat_ws(",", *cols))
答案 2 :(得分:1)
从Spark 2.3.0开始,您可以直接使用串联运算符在 spark-sql 本身中执行此操作。
spark.sql("select A||B||C from table");