Pyspark:将数据写入实木复合地板文件时,FloatType有问题

时间:2020-09-07 11:51:09

标签: apache-spark pyspark

我有以下架构,

root
 |-- A: string (nullable = true)
 |-- B: float (nullable = true)

当我将模式应用于数据时,float列的数据框值填充为错误。

Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)

请帮助我了解spark到底在做什么并在输出以下生成。

After Applying Schema Dataframe:- 

+---------+----------+
|        A|         B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2|   0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8|       0.0|
+---------+----------+

After writing to parquet :- 

           A          B
0  floadVal1   0.404413
1  floadVal2   0.285630
2  floadVal3   0.591290
3  floadVal4   0.404413
4  floadVal5  15.376102
5  floadVal6  15.261798
6  floadVal7  19.887815
7  floadVal8   0.000000

然后 按照spark doc 2.4.5 FloatType:表示4字节单精度浮点数。

Sample Code 

spark = SparkSession.builder.master('local').config(
                "spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

schema = StructType([
         StructField("A", StringType(), True),
         StructField("B", FloatType(), True)])
df = spark.createDataFrame([
                                ("floadVal1", 0.404413386),
                                ("floadVal2", 0.28563),
                                ("floadVal3", 0.591290286),
                                ("floadVal4", 0.404413386),
                                ("floadVal5", 15.37610198),
                                ("floadVal6", 15.261798303),
                                ("floadVal7", 19.887814583),
                                ("floadVal8", 0.0)
                        ], schema)

df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')

0 个答案:

没有答案