我有这个dataframe
和一个键:值广播变量字典。我想根据数据框中的另一列,用null
的键值替换dataframe
的“值” 列中的dict
值名为“ item” ,它与dict
的键相同。
这怎么办?
# mapping
dict = {'temp': '70.0', 'speed': '98', 'wind': 'TRUE'}
# sample data
df = spark.createDataFrame([('2019-05-10 7:30:05', 'device1', 'event', 'temp', None),\
('2019-05-10 7:30:05', 'device2', 'sensor', 'speed', None),\
('2019-05-10 7:30:05', 'device3', 'monitor', 'wind', None),\
('2019-05-10 7:30:10', 'device1', 'event', 'temp', '75.2'),\
('2019-05-10 7:30:10', 'device2', 'sensor', 'speed', '100'),\
('2019-05-10 7:30:10', 'device3', 'monitor', 'wind', 'FALSE'),],\
['date', 'name', 'type', 'item', 'value'])
# current input
+------------------+-------+-------+-----+-----+
| date| name| type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1| event| temp| null|
|2019-05-10 7:30:05|device2| sensor|speed| null|
|2019-05-10 7:30:05|device3|monitor| wind| null|
|2019-05-10 7:30:10|device1| event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed| 100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+
# desired output
+------------------+-------+-------+-----+-----+
| date| name| type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1| event| temp| 70.0|
|2019-05-10 7:30:05|device2| sensor|speed| 98|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device1| event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed| 100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+
答案 0 :(得分:2)
from pyspark.sql.functions import coalesce, lit, create_map, col
from itertools import chain
map_dict = create_map(*[ lit(e) for e in chain.from_iterable(dict.items()) ])
# Column<b'map(temp, 70.0, speed, 98, wind, TRUE)'>
df.withColumn('value', coalesce('value', map_dict[col('item')])).show()
#+------------------+-------+-------+-----+-----+
#| date| name| type| item|value|
#+------------------+-------+-------+-----+-----+
#|2019-05-10 7:30:05|device1| event| temp| 70.0|
#|2019-05-10 7:30:05|device2| sensor|speed| 98|
#|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
#|2019-05-10 7:30:10|device1| event| temp| 75.2|
#|2019-05-10 7:30:10|device2| sensor|speed| 100|
#|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
#+------------------+-------+-------+-----+-----+
对于非常大的dict映射,您可以创建一个数据框并进行左连接:
from pyspark.sql.functions import coalesce, broadcast
df_map = spark.createDataFrame(dict.items(), ['item', 'map_value'])
df.join(broadcast(df_map), on=['item'], how='left') \
.withColumn('value', coalesce('value', 'map_value')) \
.drop('map_value') \
.show()
答案 1 :(得分:1)
使用withColumn:
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
dict = {'temp': '70.0', 'speed': '98', 'wind': 'TRUE'}
df = spark.createDataFrame([('2019-05-10 7:30:05', 'device1', 'event', 'temp', None),\
('2019-05-10 7:30:05', 'device2', 'sensor', 'speed', None),\
('2019-05-10 7:30:05', 'device3', 'monitor', 'wind', None),\
('2019-05-10 7:30:10', 'device1', 'event', 'temp', '75.2'),\
('2019-05-10 7:30:10', 'device2', 'sensor', 'speed', '100'),\
('2019-05-10 7:30:10', 'device3', 'monitor', 'wind', 'FALSE'),],\
['date', 'name', 'type', 'item', 'value'])
def replace_null(a,b):
if b is None:
return dict[a]
else:
return b
replace_null_udf = udf(replace_null, StringType())
df2 = df.withColumn("tmp", replace_null_udf(col("item"),col("value")))
df2.show()
+------------------+-------+-------+-----+-----+
| date| name| type| item| tmp|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1| event| temp| 70.0|
|2019-05-10 7:30:05|device2| sensor|speed| 98|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device1| event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed| 100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+
df3 = df2.drop("value").withColumnRenamed('tmp','value')
df3.show()
+------------------+-------+-------+-----+-----+
| date| name| type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1| event| temp| 70.0|
|2019-05-10 7:30:05|device2| sensor|speed| 98|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device1| event| temp| 75.2|
|2019-05-10 7:30:10|device2| sensor|speed| 100|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
+------------------+-------+-------+-----+-----+
答案 2 :(得分:1)
您可以考虑以下解决方案
mapping = {'temp': '70.0', 'speed': '98', 'wind': 'TRUE'}
mappingDf = sqlContext.createDataFrame(list(mapping.items()) , ['item_t', 'value_t'])
df = df.join(mappingDf, df.item==mappingDf.item_t)
df = df.withColumn('value', F.when(F.col('value').isNotNull(), df.value).otherwise(df.value_t)).drop('item_t').drop('value_t')
df.show()
+------------------+-------+-------+-----+-----+
| date| name| type| item|value|
+------------------+-------+-------+-----+-----+
|2019-05-10 7:30:05|device1| event| temp| 70.0|
|2019-05-10 7:30:10|device1| event| temp| 75.2|
|2019-05-10 7:30:05|device3|monitor| wind| TRUE|
|2019-05-10 7:30:10|device3|monitor| wind|FALSE|
|2019-05-10 7:30:05|device2| sensor|speed| 98|
|2019-05-10 7:30:10|device2| sensor|speed| 100|
+------------------+-------+-------+-----+-----+