在Jupyter Notebook中的Spark数据框中将null替换为“无”

时间:2018-09-29 17:00:25

标签: pyspark

我发现很难用空值替换spark数据框中的每个“ None”实例。

我分配的任务要求我用Spark Null替换“无”。

当我尝试使用时:

data_sdf = data_sdf.na.fill("None", Seq("blank"))

失败。有关如何处理此问题的任何建议?

这是我需要处理的示例Spark数据框-

+--------------------+---------+---------+---------+---------+---------+---------+---------+
|         business_id|   monday|  tuesday|wednesday| thursday|   friday| saturday|   sunday|
+--------------------+---------+---------+---------+---------+---------+---------+---------+
|FYWN1wneV18bWNgQj...|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|     None|     None|
|He-G7vWjzVUysIKrf...| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0| 8:0-16:0|     None|
|KQPW8lFf1y5BT2Mxi...|     None|     None|     None|     None|     None|     None|     None|

2 个答案:

答案 0 :(得分:0)

我不知道是否有像fillna这样的直接API。 但是我们可以实现

from pyspark.sql import Row

def replace_none_with_null(r):
    return Row(**{k: None if v == "None" else v for k, v in r.asDict().iteritems()})

# data_sdf is ur dataframe
new_df = data_sdf.rdd.map(lambda x: replace_none_with_null(x)).toDF()
new_df.show()

答案 1 :(得分:0)

我认为None值将作为字符串值存储在df中。您可以轻松地将其替换为空值。如果需要,也可以用空值填充它们

>>> data = sc.parallelize([
...     ('FYWN1wneV18bWNgQj','7:30-17:0','7:30-17:0','7:30-17:0','7:30-17:0','7:30-17:0','None','None'),
...     ('He-G7vWjzVUysIKrf','9:0-20:0','9:0-20:0','9:0-20:0','9:0-20:0','9:0-16:0','8:0-16:0','None'),
...     ('KQPW8lFf1y5BT2Mxi','None','None','None','None','None','None','None')
...     ])
>>> 
>>> cols = ['business_id','monday','tuesday','wednesday',' thursday','friday','saturday','sunday']
>>> 
>>> df = spark.createDataFrame(data, cols)
>>> 
>>> df.show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
|      business_id|   monday|  tuesday|wednesday| thursday|   friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|    None|  None|
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0|  None|
|KQPW8lFf1y5BT2Mxi|     None|     None|     None|     None|     None|    None|  None|
+-----------------+---------+---------+---------+---------+---------+--------+------+

>>> df.replace('None',None).show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
|      business_id|   monday|  tuesday|wednesday| thursday|   friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|    null|  null|
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0|  null|
|KQPW8lFf1y5BT2Mxi|     null|     null|     null|     null|     null|    null|  null|
+-----------------+---------+---------+---------+---------+---------+--------+------+

>>> df.replace('None',None).na.fill('').show()
+-----------------+---------+---------+---------+---------+---------+--------+------+
|      business_id|   monday|  tuesday|wednesday| thursday|   friday|saturday|sunday|
+-----------------+---------+---------+---------+---------+---------+--------+------+
|FYWN1wneV18bWNgQj|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|7:30-17:0|        |      |
|He-G7vWjzVUysIKrf| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-20:0| 9:0-16:0|8:0-16:0|      |
|KQPW8lFf1y5BT2Mxi|         |         |         |         |         |        |      |
+-----------------+---------+---------+---------+---------+---------+--------+------+