我正在与pyspark合作。我有一个以下格式的Spark数据框
| person_id | person_attributes
____________________________________________________________________________
| id_1 "department=Sales__title=Sales_executive__level=junior"
| id_2 "department=Engineering__title=Software Engineer__level=entry-level"
我编写了一个python函数,该函数接受person_id和person_attributes并返回以下格式的json
{"id_1":{"properties":[{"department":'Sales'},{"title":'Sales_executive'},{}]}}
但是我不知道如何使用正确的输出类型将它注册为udf
中的pyspark
。这是python代码
def create_json_from_string(pid,attribute_string):
results = []
attribute_map ={}
output = {}
# Split the attribute_string into key,value pair and store it in attribute map
if attribute_string != '':
attribute_string = attribute_string.split("__") # This will be a list
for substring in attribute_string:
k,v = substring.split("=")
attribute_map[str(k)] = str(v)
for k,v in attribute_map.items():
temp = {k:v}
results.append(temp)
output ={pid : {"properties": results }}
return(output)
答案 0 :(得分:2)
您需要修改函数以仅返回字符串映射,而不形成完整的结构。之后,可以将函数应用于单个列,而不是整个行。像这样:
from pyspark.sql.types import MapType,StringType
from pyspark.sql.functions import col
def struct_from_string(attribute_string):
attribute_map ={}
if attribute_string != '':
attribute_string = attribute_string.split("__") # This will be a list
for substring in attribute_string:
k,v = substring.split("=")
attribute_map[str(k)] = str(v)
return attribute_map
my_parse_string_udf = spark.udf.register("my_parse_string", struct_from_string,
MapType(StringType(), StringType()))
,然后可以按以下方式使用它:
df2 = df.select(col("person_id"), my_parse_string_udf(col("person_attributes")))
答案 1 :(得分:1)
在spark中,UDF被视为黑匣子,如果您要使用基于dataframe api的解决方案
火花2.4 +
创建数据框
df=spark.createDataFrame([('id_1',"department=Sales__title=Sales_executive__level=junior"),('id_2',"department=Engineering__title=Software Engineer__level=entry-level")],['person_id','person_attributes'])
df.show()
+---------+--------------------+
|person_id| person_attributes|
+---------+--------------------+
| id_1|department=Sales_...|
| id_2|department=Engine...|
+---------+--------------------+
以地图格式转换person_attribute
df2 = df.select('person_id',f.map_from_arrays(f.expr('''transform(transform(split(person_attributes,'__'),x->split(x,'=')),y->y[0])'''),
f.expr('''transform(transform(split(person_attributes,'__'),x->split(x,'=')),y->y[1])''')).alias('value'))
df2.show(2,False)
+---------+-----------------------------------------------------------------------------+
|person_id|value |
+---------+-----------------------------------------------------------------------------+
|id_1 |[department -> Sales, title -> Sales_executive, level -> junior] |
|id_2 |[department -> Engineering, title -> Software Engineer, level -> entry-level]|
+---------+-----------------------------------------------------------------------------+
创建所需的结构
df2.select(f.create_map('person_id',f.create_map(f.lit('properties'),'value')).alias('json')).toJSON().collect()
['{"json":{"id_1":{"properties":{"department":"Sales","title":"Sales_executive","level":"junior"}}}}',
'{"json":{"id_2":{"properties":{"department":"Engineering","title":"Software Engineer","level":"entry-level"}}}}']
如果收集使用此数据框,则可以直接收集或使用数据框
import json
for i in data:
d = json.loads(i)
print(d['json'])
{'id_1': {'properties': {'department': 'Sales', 'title': 'Sales_executive', 'level': 'junior'}}}
{'id_2': {'properties': {'department': 'Engineering', 'title': 'Software Engineer', 'level': 'entry-level'}}}