下面是代码
from pyspark.sql.functions import rand, randn
from pyspark.sql import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import desc
spark = SparkSession.builder.appName('Sample').config("spark.some.config.option", "some-value").getOrCreate()
sqlContext = SQLContext(spark)
schema = StructType([
StructField("WID",StringType(),True),
StructField("Name",StringType(),True),
StructField("Age",IntegerType(),True),
StructField("WDomain",StringType(),True)
])
rows = [
["1","Jack",22,"Data Science"],
["2","Luke",21,"Business Analytics"],
["3","Leo",24,"Web Apps"],
["4","Mark",21,"Business Analytics"]
]
df = spark.createDataFrame(data = rows,schema=schema)
df1 = df.describe("Age")
df1.show()
df1.write.parquet("Age")
df2 = df.select("WID","Name","Age").sort(desc("Name"))
df2.show()
df2.write.parquet("NameSorted")
df3 = spark.read.parquet("dbfs:/Age")
df4 = spark.read.parquet("dbfs:/NameSorted")
df3.show()
df4.show()
当我尝试显示数据框时,它会按预期显示行数据,但是当我写入镶木地板文件时,pyspark 会自动重新排列数据。请帮我将数据框写入镶木地板,因为它没有重新排列。提前致谢。
df1.show()
的输出(在写入数据帧之前):
+-------+-----------------+
|summary| Age|
+-------+-----------------+
| count| 4|
| mean| 22.0|
| stddev|1.414213562373095|
| min| 21|
| max| 24|
+-------+-----------------+
而对于 df3.show()
(从镶木地板文件读取的数据帧 - 写入镶木地板文件后):
+-------+-----------------+
|summary| Age|
+-------+-----------------+
| stddev|1.414213562373095|
| mean| 22.0|
| count| 4|
| min| 21|
| max| 24|
+-------+-----------------+