PySpark用Array替换Null

时间:2017-06-12 14:23:51

标签: arrays null pyspark

按ID加入后,我的数据框如下所示:

ID  |  Features  |  Vector
1   | (50,[...]  | Array[1.1,2.3,...]
2   | (50,[...]  | Null

我在“Vector”列中找到了一些ID的Null值。我想用一个具有300维度的零数组(与非空向量条目相同的格式)替换这些Null值。 df.fillna在这里不起作用,因为它是我想要插入的数组。知道如何在PySpark中实现这一目标吗?

--- ---编辑

this post目前的方法类似:

df_joined = id_feat_vec.join(new_vec_df, "id", how="left_outer")

fill_with_vector = udf(lambda x: x if x is not None else np.zeros(300),
                                 ArrayType(DoubleType()))

df_new = df_joined.withColumn("vector", fill_with_vector("vector"))

遗憾的是,收效甚微:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0in stage 848.0 failed 4 times, most recent failure: Lost task 0.3 in stage 848.0 (TID 692199, 10.179.224.107, executor 16): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-193-e55fed27fcd8> in <module>()
      5 a = df_joined.withColumn("vector", fill_with_vector("vector"))
      6 
----> 7 a.show()

/databricks/spark/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
    316         """
    317         if isinstance(truncate, bool) and truncate:
--> 318             print(self._jdf.showString(n, 20))
    319         else:
    320             print(self._jdf.showString(n, int(truncate)))

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

2 个答案:

答案 0 :(得分:2)

更新:我无法获取SQL表达式表单来创建双精度数组。 &#39;数组(0.0,...)&#39;似乎创建了一个Decimal类型的数组。但是,使用python函数可以让它正确创建一个双精度数组。

一般的想法是使用when / otherwise函数有选择地仅更新所需的行。您可以提前将所需的文字值定义为列,然后将其转储到&#34;那么&#34;子句。

from pyspark.sql.types import *
from pyspark.sql.functions import *

schema = StructType([StructField("f1", LongType()), StructField("f2", ArrayType(DoubleType(), False))])
data = [(1, [10.0, 11.0]), (2, None), (3, None)]

df = sqlContext.createDataFrame(sc.parallelize(data), schema)

# Create a column object storing the value you want in the NULL case
num_elements = 300
null_value = array([lit(0.0)] * num_elements)

# If you want a different type you can change it like this
# null_value = null_value.cast('array<float>')

# Keep the value when there is one, replace it when it's null
df2 = df.withColumn('f2', when(df['f2'].isNull(), null_value).otherwise(df['f2']))

答案 1 :(得分:1)

您可以尝试使用where在数据集上发出更新请求,将Vector列中的每个NULL替换为数组。 您使用的是SparkSQL和数据帧吗?