在spark中,我有以下数据框,名为" df"有一些空条目:
+-------+--------------------+--------------------+
| id| features1| features2|
+-------+--------------------+--------------------+
| 185|(5,[0,1,4],[0.1,0...| null|
| 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...|
| 225| null|(10,[1,3,5],[0.1,...|
+-------+--------------------+--------------------+
df.features1和df.features2是类型向量(可空)。然后我尝试使用以下代码用SparseVectors填充空条目:
df1 = df.na.fill({"features1":SparseVector(5,{}), "features2":SparseVector(10, {})})
此代码导致以下错误:
AttributeError: 'SparseVector' object has no attribute '_get_object_id'
然后我在spark文档中找到了以下段落:
fillna(value, subset=None)
Replace null values, alias for na.fill(). DataFrame.fillna() and DataFrameNaFunctions.fill() are aliases of each other.
Parameters:
value – int, long, float, string, or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, or string.
这是否解释了我在DataFrame中使用SparseVectors替换空条目的失败?或者这是否意味着在DataFrame中没有办法做到这一点?
我可以通过将DataFrame转换为RDD并使用SparseVectors替换None值来实现我的目标,但是直接在DataFrame中执行此操作会更方便。
有没有方法直接在DataFrame中执行此操作? 谢谢!
答案 0 :(得分:3)
您可以使用udf
:
from pyspark.sql.functions import udf, lit
from pyspark.ml.linalg import *
fill_with_vector = udf(
lambda x, i: x if x is not None else SparseVector(i, {}),
VectorUDT()
)
df = sc.parallelize([
(SparseVector(5, {1: 1.0}), SparseVector(10, {1: -1.0})), (None, None)
]).toDF(["features1", "features2"])
(df
.withColumn("features1", fill_with_vector("features1", lit(5)))
.withColumn("features2", fill_with_vector("features2", lit(10)))
.show())
# +-------------+---------------+
# | features1| features2|
# +-------------+---------------+
# |(5,[1],[1.0])|(10,[1],[-1.0])|
# | (5,[],[])| (10,[],[])|
# +-------------+---------------+