我有pyspark数据框,其中包含要写入csv的多列嵌套结构(约30个)。 (结构
为了做到这一点,我想对所有的struct列进行字符串化。
我在这里检查了几个答案:
Pyspark converting an array of struct into string
PySpark: DataFrame - Convert Struct to Array
PySpark convert struct field inside array to string
这是我的数据框的结构(带有约30个复杂键):
root
|-- 1_simple_key: string (nullable = true)
|-- 2_simple_key: string (nullable = true)
|-- 3_complex_key: struct (nullable = true)
| |-- n1: string (nullable = true)
| |-- n2: struct (nullable = true)
| | |-- n3: boolean (nullable = true)
| | |-- n4: boolean (nullable = true)
| | |-- n5: boolean (nullable = true)
| |-- n6: long (nullable = true)
| |-- n7: long (nullable = true)
|-- 4_complex_key: struct (nullable = true)
| |-- n1: string (nullable = true)
| |-- n2: struct (nullable = true)
| | |-- n3: boolean (nullable = true)
| | |-- n4: boolean (nullable = true)
| | |-- n5: boolean (nullable = true)
| |-- n6: long (nullable = true)
| |-- n7: long (nullable = true)
|-- 5_complex_key: struct (nullable = true)
| |-- n1: string (nullable = true)
| |-- n2: struct (nullable = true)
| | |-- n3: boolean (nullable = true)
| | |-- n4: boolean (nullable = true)
| | |-- n5: boolean (nullable = true)
| |-- n6: long (nullable = true)
| |-- n7: long (nullable = true)
建议的解决方案是针对单个列的,而我不能将其应用于多个列。
我想做这种事情:
1.对于每个struct_column:
2. col = stringify(struct_column)
我不介意为其创建其他数据框。我只需要准备好进行csv编写即可。
最小的可复制示例:
from pyspark.sql import Row
d = d = {'1_complex_key': {0: Row(type='1_complex_key', s=Row(n1=False, n2=False, n3=True), x=954, y=238), 1: Row(type='1_complex_key', s=Row(n1=False, n2=False, n3=True), x=956, y=250), 2: Row(type='1_complex_key', s=Row(n1=True, n2=False, n3=False), x=886, y=269)}, '2_complex_key': {0: Row(type='2_complex_key', s=Row(n1=False, n2=False, n3=True), x=901, y=235), 1: Row(type='2_complex_key', s=Row(n1=False, n2=False, n3=True), x=905, y=249), 2: Row(type='2_complex_key', s=Row(n1=False, n2=False, n3=True), x=868, y=270)}, '3_complex_key': {0: Row(type='3_complex_key', s=Row(n1=True, n2=False, n3=False), x=925, y=197), 1: Row(type='3_complex_key', s=Row(n1=False, n2=False, n3=True), x=928, y=206), 2: Row(type='3_complex_key', s=Row(n1=False, n2=False, n3=True), x=883, y=236)}}
df = pd.DataFrame.from_dict(d)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
s_df = spark.createDataFrame(df)
s_df.printSchema()
s_df.write.csv('it_doesnt_write.csv')
所以-总结一下: 我有一个要写入CSV的spark数据框。 我无法将其写入CSV,因为:
'CSV data source does not support struct<s:struct<n1:boolean,n2:boolean,n3:boolean>,type:string,x:bigint,y:bigint> data type.;'
因此,我想对该数据框执行一些操作/可逆转换,以便可以将其写入CSV,然后再从CSV中读取它,并使其成为具有相同模式的Spark数据框。
我该怎么办?谢谢
答案 0 :(得分:0)
正如pault在评论中已经提到的那样,您需要一个列表理解。这种列表理解需要一个列列表和一个将其转换为字符串的函数。我将使用df.columns
和to_json,但您也可以提供自己的python列名列表和自定义函数,以字符串化复杂列。
#this converts all columns to json strings
#and writes it as to disk
s_df.select([F.to_json(x) for x in s_df.columns]).coalesce(1).write.csv('/tmp/testcsv')
如果您不想将to_json应用于所有列,则可以像这样简单地对其进行修改:
list4tojson = ['2_complex_key', '3_complex_key']
s_df.select('1_complex_key', *[F.to_json(x) for x in list4tojson]).coalesce(1).write.csv('/tmp/testcsv')
您可以使用from_json恢复数据帧:
df = spark.read.csv('/tmp/testcsv')
df.printSchema()
#root
# |-- _c0: string (nullable = true)
# |-- _c1: string (nullable = true)
# |-- _c2: string (nullable = true)
#interfering the schema
json_schema = spark.read.json(df.rdd.map(lambda row: row._c0)).schema
df.select([F.from_json(x, json_schema) for x in df.columns] ).printSchema()
#root
# |-- jsontostructs(_c0): struct (nullable = true)
# | |-- s: struct (nullable = true)
# | | |-- n1: boolean (nullable = true)
# | | |-- n2: boolean (nullable = true)
# | | |-- n3: boolean (nullable = true)
# | |-- type: string (nullable = true)
# | |-- x: long (nullable = true)
# | |-- y: long (nullable = true)
# |-- jsontostructs(_c1): struct (nullable = true)
# | |-- s: struct (nullable = true)
# | | |-- n1: boolean (nullable = true)
# | | |-- n2: boolean (nullable = true)
# | | |-- n3: boolean (nullable = true)
# | |-- type: string (nullable = true)
# | |-- x: long (nullable = true)
# | |-- y: long (nullable = true)
# |-- jsontostructs(_c2): struct (nullable = true)
# | |-- s: struct (nullable = true)
# | | |-- n1: boolean (nullable = true)
# | | |-- n2: boolean (nullable = true)
# | | |-- n3: boolean (nullable = true)
# | |-- type: string (nullable = true)
# | |-- x: long (nullable = true)
# | |-- y: long (nullable = true)
如果您只想以可读格式存储数据,则可以通过直接将其直接写入json来避免上述所有代码:
s_df.coalesce(1).write.json('/tmp/testjson')
df = spark.read.json('/tmp/testjson')