如何在Pyspark数据框中的轴= 1上连接ArrayType的2列?

时间:2019-12-09 18:57:06

标签: python pyspark concatenation

我有以下数据框:

我想将 lat lon 连接到一个列表中。 mmsi 与ID相似(这是唯一的)

const [one, two, three, four, five] = await Promise.all([
    axios.get(`https://meta.discourse.org/directory_items.json?order=days_visited&page=0&period=all`),
    axios.get(`https://meta.discourse.org/directory_items.json?order=days_visited&page=1&period=all`),
    axios.get(`https://meta.discourse.org/directory_items.json?order=days_visited&page=2&period=all`),
    axios.get(`https://meta.discourse.org/directory_items.json?order=days_visited&page=3&period=all`),
    axios.get(`https://meta.discourse.org/directory_items.json?order=days_visited&page=4&period=all`)
  ]);

var users_array = one.data.directory_items.concat(two.data.directory_items, three.data.directory_items, four.data.directory_items, five.data.directory_items)

因此,我想将lat和lon数组连接起来,但在axis = 1上,也就是说,我希望在单独的列中最后有一个列表列表,例如:

+---------+--------------------+--------------------+
|     mmsi|                 lat|                 lon|
+---------+--------------------+--------------------+
|255801480|[47.1018366666666...|[-5.3017783333333...|
|304182000|[44.6343033333333...|[-63.564803333333...|
|304682000|[41.1936, 41.1715...|[-8.7716, -8.7514...|
|305930000|[49.5221333333333...|[-3.6310166666666...|
|306216000|[42.8185133333333...|[-29.853155, -29....|
|477514400|[47.17205, 47.165...|[-58.6317, -58.60...|

在pyspark数据框中怎么可能?我已经尝试过concat,但是它将返回:

[[47.1018366666666, -5.3017783333333], ... ]

非常感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

从Spark 2.4版本开始,您可以使用内置功能 arrays_zip

from pyspark.sql.functions import arrays_zip
df.withColumn('zipped_lat_lon',arrays_zip(df.lat,df.lon)).show()