我在spark数据框中有一列列表。
+-----------------+
|features |
+-----------------+
|[0,45,63,0,0,0,0]|
|[0,0,0,85,0,69,0]|
|[0,89,56,0,0,0,0]|
+-----------------+
如何将其转换为火花数据框,其中列表中的每个元素都是数据框中的一列?我们可以假设列表大小相同。
例如,
+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+
答案 0 :(得分:4)
您所描述的实际上是VectorAssembler
操作的反转。
您可以通过转换为中间RDD来完成此操作,如下所示:
spark.version
# u'2.2.0'
# your data:
df.show(truncate=False)
# +-----------------+
# | features |
# +-----------------+
# |[0,45,63,0,0,0,0]|
# |[0,0,0,85,0,69,0]|
# |[0,89,56,0,0,0,0]|
# +-----------------+
dimensionality = 7
out = df.rdd.map(lambda x: [float(x[0][i]) for i in range(dimensionality)]).toDF(schema=['c'+str(i+1) for i in range(dimensionality)])
out.show()
# +---+----+----+----+---+----+---+
# | c1| c2| c3| c4| c5| c6| c7|
# +---+----+----+----+---+----+---+
# |0.0|45.0|63.0| 0.0|0.0| 0.0|0.0|
# |0.0| 0.0| 0.0|85.0|0.0|69.0|0.0|
# |0.0|89.0|56.0| 0.0|0.0| 0.0|0.0|
# +---+----+----+----+---+----+---+
答案 1 :(得分:3)
您可以使用getItem
:
df.withColumn("c1", df["features"].getItem(0))\
.withColumn("c2", df["features"].getItem(1))\
.withColumn("c3", df["features"].getItem(2))\
.withColumn("c4", df["features"].getItem(3))\
.withColumn("c5", df["features"].getItem(4))\
.withColumn("c6", df["features"].getItem(5))\
.withColumn("c7", df["features"].getItem(6))\
.drop('features').show()
+--------------------+
|c1|c2|c3|c4|c5|c6|c7|
+--------------------+
|0 |45|63|0 |0 |0 |0 |
|0 |0 |0 |85|0 |69|0 |
|0 |89|56|0 |0 |0 |0 |
+--------------------+
答案 2 :(得分:3)
这是一种替代方案,无需转换为rdd,
from pyspark.sql import functions as F
##Not incase of vectorAssembeler.
stop = df.select(F.max(F.size('features')).alias('size')).first().size ## if having a list of varying size, this might be useful.
udf1 = F.udf(lambda x : x.toArray().tolist(),ArrayType(FloatType()))
df = df.withColumn('features1',udf1('features'))
df.select(*[df.features1[i].alias('col_{}'.format(i)) for i in range(1,stop)]).show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 45| 63| 0| 0| 0| 0|
| 0| 0| 85| 0| 69| 0|
+-----+-----+-----+-----+-----+-----+
答案 3 :(得分:2)
@ desertnaut的回答也可以用dataframe和udf完成。
import pyspark.sql.functions as F
dimensionality = 7
column_names = ['c'+str(i+1) for i in range(dimensionality)]
splits = [F.udf(lambda val:val[i],FloatType()) for i in range(dimensionality)]
df = df.select(*[s('features').alias(j) for s,j in zip(splits,column_names)])