Question

我的一个数据框（spark.sql）具有此架构。

root
 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |    |-- abc: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- a0: string (nullable = true)
 |    |    |    |-- a1: string (nullable = true)
 |    |    |    |-- a2: string (nullable = true)
 |    |    |    |-- a3: string (nullable = true)
 |-- ValueC: struct (nullable = true)
 |    |-- pqr: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- info1: string (nullable = true)
 |    |    |    |-- info2: struct (nullable = true)
 |    |    |    |    |-- x1: long (nullable = true)
 |    |    |    |    |-- x2: long (nullable = true)
 |    |    |    |    |-- x3: string (nullable = true)
 |    |    |    |-- info3: string (nullable = true)
 |    |    |    |-- info4: string (nullable = true)
 |-- Value4: struct (nullable = true)
 |    |-- xyz: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- b0: string (nullable = true)
 |    |    |    |-- b2: string (nullable = true)
 |    |    |    |-- b3: string (nullable = true)
 |-- Value5: string (nullable = true)

我需要将其保存到CSV文件，但不使用任何拼合，请按以下格式爆炸。

 |-- ValueA: string (nullable = true)
 |-- ValueB: struct (nullable = true)
 |-- ValueC: struct (nullable = true)
 |-- ValueD: struct (nullable = true)
 |-- ValueE: string (nullable = true)

我已经直接使用命令[df.to_pandas().to_csv("output.csv")]达到了我的目的，但是我需要一个更好的方法。我正在使用pyspark

Answer 1

在Spark中，编写  rtp_to_hls ●  gcc --version Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 11.0.3 (clang-1103.0.32.62) Target: x86_64-apple-darwin19.5.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin rtp_to_hls ●  make --version GNU Make 3.81 Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. This program built for i386-apple-darwin11.3.0格式尚不支持编写csv复杂类型。

struct/array..etc

Spark中更好的方法是使用 Write as Parquet file: 格式，因为实木复合地板格式支持所有 parquet 并在读取/读取时提供更好的性能。写作。

nested data types

df.write.parquet("<path>")

如果接受以json格式写的话，

Write as Json file:

df.write.json("path") #or df.toJSON().saveAsTextFile("path")

使用 Write as CSV file: 函数将json to_json 转换为 struct/Array 并存储为csv格式

string

使用pyspark将StructType，ArrayType转换/转换为StringType（单值）

1 个答案: