有没有办法在创建镶木地板后从 pyspark 数据帧写入镶木地板文件而不自动排列行

时间:2021-02-26 10:35:16

标签: python dataframe apache-spark pyspark

下面是代码

from pyspark.sql.functions import rand, randn 
from pyspark.sql import * 
from pyspark.sql.types import StructType,StructField, StringType, IntegerType 
from pyspark.sql.functions import desc


spark = SparkSession.builder.appName('Sample').config("spark.some.config.option", "some-value").getOrCreate() 
sqlContext = SQLContext(spark) 
schema = StructType([ 
    StructField("WID",StringType(),True), 
    StructField("Name",StringType(),True), 
    StructField("Age",IntegerType(),True), 
    StructField("WDomain",StringType(),True) 
])


rows = [ 
  ["1","Jack",22,"Data Science"], 
  ["2","Luke",21,"Business Analytics"], 
  ["3","Leo",24,"Web Apps"], 
  ["4","Mark",21,"Business Analytics"] 
]

df = spark.createDataFrame(data = rows,schema=schema) 

df1 = df.describe("Age") 
df1.show() 

df1.write.parquet("Age") 

df2 = df.select("WID","Name","Age").sort(desc("Name")) 
df2.show() 

df2.write.parquet("NameSorted") 

df3 = spark.read.parquet("dbfs:/Age") 
df4 = spark.read.parquet("dbfs:/NameSorted") 


df3.show() 
df4.show()

当我尝试显示数据框时,它会按预期显示行数据,但是当我写入镶木地板文件时,pyspark 会自动重新排列数据。请帮我将数据框写入镶木地板,因为它没有重新排列。提前致谢。

df1.show() 的输出(在写入数据帧之前):

+-------+-----------------+
|summary|              Age|
+-------+-----------------+
|  count|                4|
|   mean|             22.0|
| stddev|1.414213562373095|
|    min|               21|
|    max|               24|
+-------+-----------------+

而对于 df3.show()(从镶木地板文件读取的数据帧 - 写入镶木地板文件后):

+-------+-----------------+
|summary|              Age|
+-------+-----------------+
| stddev|1.414213562373095|
|   mean|             22.0|
|  count|                4|
|    min|               21|
|    max|               24|
+-------+-----------------+

0 个答案:

没有答案