Scala Spark设置架构重复列

时间:2018-03-29 23:14:27

标签: scala apache-spark dataframe spark-dataframe

指定数据帧的架构时遇到问题。如果不设置架构,printschema()会生成:

root
 |-- Store: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- IsHoliday: string (nullable = true)
 |-- Dept: string (nullable = true)
 |-- Weekly_Sales: string (nullable = true)
 |-- Temperature: string (nullable = true)
 |-- Fuel_Price: string (nullable = true)
 |-- MarkDown1: string (nullable = true)
 |-- MarkDown2: string (nullable = true)
 |-- MarkDown3: string (nullable = true)
 |-- MarkDown4: string (nullable = true)
 |-- MarkDown5: string (nullable = true)
 |-- CPI: string (nullable = true)
 |-- Unemployment: string (nullable = true)

但是,当我使用.schema(架构)

指定架构时
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)

我的printschema()产生:

root
 |-- Store: integer (nullable = true)
 |-- Date: date (nullable = true)
 |-- IsHoliday: boolean (nullable = true)
 |-- Dept: integer (nullable = true)
 |-- Weekly_Sales: integer (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Fuel_Price: double (nullable = true)
 |-- MarkDown1: double (nullable = true)
 |-- MarkDown2: double (nullable = true)
 |-- MarkDown3: double (nullable = true)
 |-- MarkDown4: double (nullable = true)
 |-- MarkDown5: double (nullable = true)
 |-- CPI: double (nullable = true)
 |-- Unemployment: double (nullable = true)
 |-- Dept: integer (nullable = true)
 |-- Weekly_Sales: integer (nullable = true)
 |-- Temperature: double (nullable = true)
 |-- Fuel_Price: double (nullable = true)
 |-- MarkDown1: double (nullable = true)
 |-- MarkDown2: double (nullable = true)
 |-- MarkDown3: double (nullable = true)
 |-- MarkDown4: double (nullable = true)
 |-- MarkDown5: double (nullable = true)
 |-- CPI: double (nullable = true)
 |-- Unemployment: double (nullable = true)

数据框本身包含所有这些重复列,我不确定原因。

我的代码:

// Make cutom schema
val schema = StructType(Array(
       StructField("Store", IntegerType, true),
       StructField("Date", DateType, true),
       StructField("IsHoliday", BooleanType, true),
       StructField("Dept", IntegerType, true),
       StructField("Weekly_Sales", IntegerType, true),
       StructField("Temperature", DoubleType, true),
       StructField("Fuel_Price", DoubleType, true),
       StructField("MarkDown1", DoubleType, true),
       StructField("MarkDown2", DoubleType, true),
       StructField("MarkDown3", DoubleType, true),
       StructField("MarkDown4", DoubleType, true),
       StructField("MarkDown5", DoubleType, true),
       StructField("CPI", DoubleType, true),
       StructField("Unemployment", DoubleType, true)))

val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")

// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()

1 个答案:

答案 0 :(得分:1)

它按预期工作。 train_df, features_df之后schema (14 columns)load()的列相同。

在您的连接条件之后,Seq("Store", "Date", "IsHoliday")从两个DF(总共3 + 3 = 6列)和join获取3列并给出一组列名称(3列)。但其余列将来自train_df(rest 11 columns), features_df(rest 11 columns).

因此,printSchema显示25列(3 + 11 + 11)。