Question

在桌子上运行describe后，我在RedShift中有以下结构（所有字段都是Nullable）：

a integer
b numeric(18)
c date
d char(3)
e smallint
f char(1)
g varchar(20)
h numeric(11,2)

所有数据都被提取到S3。现在想要将数据加载到Spark Dataframe中，但也需要为此表创建适当的模式。

这些字段的Spark架构如何？

这种结构是否正确？（特别想知道数字（11,2），日期，字母（1）字段）

val schema = StructType( 
    Array( 
        StructField("a", IntegerType, true), 
        StructField("b", IntegerType, true), 
        StructField("c", StringType, true),
        StructField("d", StringType, true),
        StructField("e", IntegerType, true),
        StructField("f", StringType, true),
        StructField("g", StringType, true),
        StructField("h", IntegerType, true)
    ) 
)

Answer 1

您应该使用：

DoubleType或DecimalType代表浮点值（如NUMERIC(11,2)）。在我看来，十进制更好，因为它在BigDecimals上运行
LongType代表非常大的数字 - 例如NUMERIC(18)。否则将无法正确存储
DateType表示日期 - 可以存储为字符串，但如果可以，则应选择更有意义的类型

适当的Spark模式（将数据加载到Dataframe中时）

1 个答案: