我有以下架构,
root
|-- A: string (nullable = true)
|-- B: float (nullable = true)
当我将模式应用于数据时,float列的数据框值填充为错误。
Original Data :-
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
请帮助我了解spark到底在做什么并在输出以下生成。
After Applying Schema Dataframe:-
+---------+----------+
| A| B|
+---------+----------+
|floadVal1|0.40441337|
|floadVal2| 0.28563|
|floadVal3| 0.5912903|
|floadVal4|0.40441337|
|floadVal5| 15.376102|
|floadVal6| 15.261798|
|floadVal7| 19.887815|
|floadVal8| 0.0|
+---------+----------+
After writing to parquet :-
A B
0 floadVal1 0.404413
1 floadVal2 0.285630
2 floadVal3 0.591290
3 floadVal4 0.404413
4 floadVal5 15.376102
5 floadVal6 15.261798
6 floadVal7 19.887815
7 floadVal8 0.000000
然后 按照spark doc 2.4.5 FloatType:表示4字节单精度浮点数。
Sample Code
spark = SparkSession.builder.master('local').config(
"spark.sql.parquet.writeLegacyFormat", 'true').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
schema = StructType([
StructField("A", StringType(), True),
StructField("B", FloatType(), True)])
df = spark.createDataFrame([
("floadVal1", 0.404413386),
("floadVal2", 0.28563),
("floadVal3", 0.591290286),
("floadVal4", 0.404413386),
("floadVal5", 15.37610198),
("floadVal6", 15.261798303),
("floadVal7", 19.887814583),
("floadVal8", 0.0)
], schema)
df.printSchema()
df.show()
df.write.format("parquet").save('floatTestParFile')