从PySpark DataFrame列中提取子数组

时间:2018-12-17 08:52:15

标签: python arrays list pyspark

我希望从此DataFrame中删除数组的最后一个元素。我们有这个link展示了同样的内容,但是我们使用UDFs来避免。有没有简单的方法可以执行此操作-例如list[:2]

data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
|               data|
+-------------------+
|  [cat, dog, sheep]|
|  [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+

预期的DataFrame:

+--------------+
|          data|
+--------------+
|    [cat, dog]|
|  [bus, truck]|
|  [ice, pizza]|
+--------------+

1 个答案:

答案 0 :(得分:0)

UDF是PySpark最好的东西:)

from pyspark.sql.functions import udf
from pyspark.sql.types import StructType

# Get the fist two elements 
split_row = udf(lambda row: row[:2])

# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))

new_df.show()
# Output

+------------+
|        data|
+------------+
|  [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+