我的一个数据框(spark.sql)具有此架构。
root
|-- ValueA: string (nullable = true)
|-- ValueB: struct (nullable = true)
| |-- abc: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- a0: string (nullable = true)
| | | |-- a1: string (nullable = true)
| | | |-- a2: string (nullable = true)
| | | |-- a3: string (nullable = true)
|-- ValueC: struct (nullable = true)
| |-- pqr: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- info1: string (nullable = true)
| | | |-- info2: struct (nullable = true)
| | | | |-- x1: long (nullable = true)
| | | | |-- x2: long (nullable = true)
| | | | |-- x3: string (nullable = true)
| | | |-- info3: string (nullable = true)
| | | |-- info4: string (nullable = true)
|-- Value4: struct (nullable = true)
| |-- xyz: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- b0: string (nullable = true)
| | | |-- b2: string (nullable = true)
| | | |-- b3: string (nullable = true)
|-- Value5: string (nullable = true)
我需要将其保存到CSV文件,但不使用任何拼合,请按以下格式爆炸。
|-- ValueA: string (nullable = true)
|-- ValueB: struct (nullable = true)
|-- ValueC: struct (nullable = true)
|-- ValueD: struct (nullable = true)
|-- ValueE: string (nullable = true)
我已经直接使用命令[df.to_pandas().to_csv("output.csv")]
达到了我的目的,但是我需要一个更好的方法。我正在使用pyspark
答案 0 :(得分:1)
在Spark中,编写 rtp_to_hls ● gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
rtp_to_hls ● make --version
GNU Make 3.81
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
This program built for i386-apple-darwin11.3.0
格式尚不支持编写csv
复杂类型。
struct/array..etc
Spark中更好的方法是使用 Write as Parquet file:
格式,因为实木复合地板格式支持所有 parquet
并在读取/读取时提供更好的性能。写作。
nested data types
df.write.parquet("<path>")
如果接受以json格式写的话,
Write as Json file:
df.write.json("path")
#or
df.toJSON().saveAsTextFile("path")
使用 Write as CSV file:
函数将json to_json
转换为 struct/Array
并存储为csv格式
string