修改spark数据帧列

时间:2016-11-17 10:28:38

标签: apache-spark spark-dataframe

我有一个火花数据框,我想添加一个具有一些特定值的新列。我尝试使用withcolumn函数,但它没有按预期工作。我想要一个具有特定值的新列,或者我想要替换现有列

1 个答案:

答案 0 :(得分:0)

参见此示例

我有一个dataFrame:

>>> df.show()
+-------+----+-----+---+
|   name|year|month|day|
+-------+----+-----+---+
|    Ali|2014|    9|  1|
|  Matei|2015|   10| 26|
|Michael|2015|   10| 25|
|Reynold|2015|   10| 25|
|Patrick|2015|    9|  1|
+-------+----+-----+---+

我想为每一行添加一个信息,我可以使用lit来执行此操作

>>> from pyspark.sql.functions import lit
>>> df.withColumn('my_new_column', lit('testing info for all')).show()
+-------+----+-----+---+--------------------+
|   name|year|month|day|       my_new_column|
+-------+----+-----+---+--------------------+
|    Ali|2014|    9|  1|testing info for all|
|  Matei|2015|   10| 26|testing info for all|
|Michael|2015|   10| 25|testing info for all|
|Reynold|2015|   10| 25|testing info for all|
|Patrick|2015|    9|  1|testing info for all|
+-------+----+-----+---+--------------------+

如果您想为每行添加不同信息的列表,可以使用explode

>>> from pyspark.sql.functions import explode
>>> df.withColumn('my_new_column', 
...               explode(array(lit('testing info for all'), 
...                             lit('other testing again')))).show()
+-------+----+-----+---+--------------------+
|   name|year|month|day|       my_new_column|
+-------+----+-----+---+--------------------+
|    Ali|2014|    9|  1|testing info for all|
|    Ali|2014|    9|  1| other testing again|
|  Matei|2015|   10| 26|testing info for all|
|  Matei|2015|   10| 26| other testing again|
|Michael|2015|   10| 25|testing info for all|
|Michael|2015|   10| 25| other testing again|
|Reynold|2015|   10| 25|testing info for all|
|Reynold|2015|   10| 25| other testing again|
|Patrick|2015|    9|  1|testing info for all|
|Patrick|2015|    9|  1| other testing again|
+-------+----+-----+---+--------------------+