我想将两个不同的数组列表合并为一个。每个数组都是spark数据帧中的一列。因此,我想使用udf
def some_function(u,v):
li = list()
for x,y in zip(u,v):
li.append(x.extend(y))
return li
udf_object = udf(some_function,ArrayType(ArrayType(StringType()))))
new_x = x.withColumn('new_name',udf_object(col('name'),col('features')))
这是数据架构:
root
|-- blockingkey: string (nullable = true)
|-- blocked_records: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- flattened_array: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: float (containsNull = true)
|-- name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
我正在尝试合并名称和功能。因此,就像名称中的第一个元素将与要素中的第一个元素合并一样。 但这仅在存在Integer或FloatValues时返回具有NUll值的数组。如果可以使用udf或其他明智的方法来解决此问题,请帮助我。
答案 0 :(得分:0)
如果您将dataframe
和schema
设置为
+------------------------------------------------+----------------------------------------+
|features |name |
+------------------------------------------------+----------------------------------------+
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|
+------------------------------------------------+----------------------------------------+
root
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
|-- name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
然后,您可以定义udf
函数并以
udf
函数
import pyspark.sql.types as t
from pyspark.sql import functions as f
def some_function(u,v):
li = []
for x, y in zip(u, v):
li.append(x + y)
return li
udf_object = f.udf(some_function,t.ArrayType(t.ArrayType(t.StringType())))
new_x = x.withColumn('new_name',udf_object(f.col('name'),f.col('features')))
所以new_x
将是
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
|features |name |new_name |
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|[WrappedArray(a, b, 2.0, 3.0), WrappedArray(c, d, 3.0, 5.0)]|
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|[WrappedArray(a, b, 2.0, 3.0), WrappedArray(c, d, 3.0, 5.0)]|
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
root
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
|-- name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- new_name: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
我希望答案会有所帮助