用Dict的键值替换Spark DataFrame中的NULL值

时间:2019-11-18 16:59:51

标签: python apache-spark pyspark

我有这个dataframe和一个键:值广播变量字典。我想根据数据框中的另一列,用null的键值替换dataframe“值” 列中的dict值名为“ item” ,它与dict的键相同。

这怎么办?

# mapping
dict = {'temp': '70.0', 'speed': '98', 'wind': 'TRUE'}

# sample data
df = spark.createDataFrame([('2019-05-10 7:30:05', 'device1', 'event', 'temp', None),\
                            ('2019-05-10 7:30:05', 'device2', 'sensor', 'speed', None),\
                            ('2019-05-10 7:30:05', 'device3', 'monitor', 'wind', None),\
                            ('2019-05-10 7:30:10', 'device1', 'event', 'temp', '75.2'),\
                            ('2019-05-10 7:30:10', 'device2', 'sensor', 'speed', '100'),\
                            ('2019-05-10 7:30:10', 'device3', 'monitor', 'wind', 'FALSE'),],\
                            ['date', 'name', 'type', 'item', 'value'])

# current input
+------------------+-------+-------+-----+-----+
|              date|   name|   type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1|  event| temp| null|
|2019-05-10 7:30:05|device2| sensor|speed| null|
|2019-05-10 7:30:05|device3|monitor| wind| null|
|2019-05-10 7:30:10|device1|  event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed|  100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+

# desired output
+------------------+-------+-------+-----+-----+
|              date|   name|   type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1|  event| temp| 70.0|
|2019-05-10 7:30:05|device2| sensor|speed|   98|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device1|  event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed|  100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+

3 个答案:

答案 0 :(得分:2)

使用coalescecreate_map

from pyspark.sql.functions import coalesce, lit, create_map, col
from itertools import chain 

map_dict = create_map(*[ lit(e) for e in chain.from_iterable(dict.items()) ])
# Column<b'map(temp, 70.0, speed, 98, wind, TRUE)'>

df.withColumn('value', coalesce('value', map_dict[col('item')])).show()
#+------------------+-------+-------+-----+-----+
#|              date|   name|   type| item|value|
#+------------------+-------+-------+-----+-----+
#|2019-05-10 7:30:05|device1|  event| temp| 70.0|
#|2019-05-10 7:30:05|device2| sensor|speed|   98|
#|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
#|2019-05-10 7:30:10|device1|  event| temp| 75.2|
#|2019-05-10 7:30:10|device2| sensor|speed|  100|
#|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
#+------------------+-------+-------+-----+-----+

对于非常大的dict映射,您可以创建一个数据框并进行左连接:

from pyspark.sql.functions import coalesce, broadcast

df_map = spark.createDataFrame(dict.items(), ['item', 'map_value'])

df.join(broadcast(df_map), on=['item'], how='left') \
  .withColumn('value', coalesce('value', 'map_value')) \
  .drop('map_value') \
  .show()

答案 1 :(得分:1)

使用withColumn:

from pyspark.sql.functions import col
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

dict = {'temp': '70.0', 'speed': '98', 'wind': 'TRUE'}
df = spark.createDataFrame([('2019-05-10 7:30:05', 'device1', 'event', 'temp', None),\
                            ('2019-05-10 7:30:05', 'device2', 'sensor', 'speed', None),\
                            ('2019-05-10 7:30:05', 'device3', 'monitor', 'wind', None),\
                            ('2019-05-10 7:30:10', 'device1', 'event', 'temp', '75.2'),\
                            ('2019-05-10 7:30:10', 'device2', 'sensor', 'speed', '100'),\
                            ('2019-05-10 7:30:10', 'device3', 'monitor', 'wind', 'FALSE'),],\
                            ['date', 'name', 'type', 'item', 'value'])
def replace_null(a,b):
  if b is None:
    return dict[a]
  else:
    return b

replace_null_udf = udf(replace_null, StringType())

df2 = df.withColumn("tmp", replace_null_udf(col("item"),col("value")))
df2.show()
+------------------+-------+-------+-----+-----+
|              date|   name|   type| item|  tmp|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1|  event| temp| 70.0|
|2019-05-10 7:30:05|device2| sensor|speed|   98|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device1|  event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed|  100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+
df3 = df2.drop("value").withColumnRenamed('tmp','value')
df3.show()
+------------------+-------+-------+-----+-----+
|              date|   name|   type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1|  event| temp| 70.0|
|2019-05-10 7:30:05|device2| sensor|speed|   98|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device1|  event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed|  100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+

答案 2 :(得分:1)

您可以考虑以下解决方案

mapping = {'temp': '70.0', 'speed': '98', 'wind': 'TRUE'}

mappingDf = sqlContext.createDataFrame(list(mapping.items()) , ['item_t', 'value_t'])

df = df.join(mappingDf, df.item==mappingDf.item_t)
df = df.withColumn('value', F.when(F.col('value').isNotNull(), df.value).otherwise(df.value_t)).drop('item_t').drop('value_t')
df.show()
+------------------+-------+-------+-----+-----+
|              date|   name|   type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1|  event| temp| 70.0|
|2019-05-10 7:30:10|device1|  event| temp| 75.2|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
|2019-05-10 7:30:05|device2| sensor|speed|   98|
|2019-05-10 7:30:10|device2| sensor|speed|  100|
+------------------+-------+-------+-----+-----+