我发现很难用空值替换spark数据框中的每个“ None”实例。
我分配的任务要求我用Spark Null替换“无”。
当我尝试使用时:
data_sdf = data_sdf.na.fill("None", Seq("blank"))
失败。有关如何处理此问题的任何建议?
这是我需要处理的示例Spark数据框-
+--------------------+---------+---------+---------+---------+---------+---------+---------+
| business_id| monday| tuesday|wednesday| thursday| friday| saturday| sunday|
+--------------------+---------+---------+---------+---------+---------+---------+---------+
|FYWN1wneV18bWNgQj...|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| None| None|
|He-G7vWjzVUysIKrf...| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0| 8:0-16:0| None|
|KQPW8lFf1y5BT2Mxi...| None| None| None| None| None| None| None|
答案 0 :(得分:0)
我不知道是否有像fillna这样的直接API。 但是我们可以实现
from pyspark.sql import Row
def replace_none_with_null(r):
return Row(**{k: None if v == "None" else v for k, v in r.asDict().iteritems()})
# data_sdf is ur dataframe
new_df = data_sdf.rdd.map(lambda x: replace_none_with_null(x)).toDF()
new_df.show()
答案 1 :(得分:0)
我认为None值将作为字符串值存储在df中。您可以轻松地将其替换为空值。如果需要,也可以用空值填充它们
>>> data = sc.parallelize([
... ('FYWN1wneV18bWNgQj','7:30-17:0','7:30-17:0','7:30-17:0','7:30-17:0','7:30-17:0','None','None'),
... ('He-G7vWjzVUysIKrf','9:0-20:0','9:0-20:0','9:0-20:0','9:0-20:0','9:0-16:0','8:0-16:0','None'),
... ('KQPW8lFf1y5BT2Mxi','None','None','None','None','None','None','None')
... ])
>>>
>>> cols = ['business_id','monday','tuesday','wednesday',' thursday','friday','saturday','sunday']
>>>
>>> df = spark.createDataFrame(data, cols)
>>>
>>> df.show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
| business_id| monday| tuesday|wednesday| thursday| friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| None| None|
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0| None|
|KQPW8lFf1y5BT2Mxi| None| None| None| None| None| None| None|
+-----------------+---------+---------+---------+---------+---------+--------+------+
>>> df.replace('None',None).show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
| business_id| monday| tuesday|wednesday| thursday| friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| null| null|
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0| null|
|KQPW8lFf1y5BT2Mxi| null| null| null| null| null| null| null|
+-----------------+---------+---------+---------+---------+---------+--------+------+
>>> df.replace('None',None).na.fill('').show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
| business_id| monday| tuesday|wednesday| thursday| friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0| | |
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0| |
|KQPW8lFf1y5BT2Mxi| | | | | | | |
+-----------------+---------+---------+---------+---------+---------+--------+------+