指定数据帧的架构时遇到问题。如果不设置架构,printschema()会生成:
root
|-- Store: string (nullable = true)
|-- Date: string (nullable = true)
|-- IsHoliday: string (nullable = true)
|-- Dept: string (nullable = true)
|-- Weekly_Sales: string (nullable = true)
|-- Temperature: string (nullable = true)
|-- Fuel_Price: string (nullable = true)
|-- MarkDown1: string (nullable = true)
|-- MarkDown2: string (nullable = true)
|-- MarkDown3: string (nullable = true)
|-- MarkDown4: string (nullable = true)
|-- MarkDown5: string (nullable = true)
|-- CPI: string (nullable = true)
|-- Unemployment: string (nullable = true)
但是,当我使用.schema(架构)
指定架构时val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
我的printschema()产生:
root
|-- Store: integer (nullable = true)
|-- Date: date (nullable = true)
|-- IsHoliday: boolean (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
数据框本身包含所有这些重复列,我不确定原因。
我的代码:
// Make cutom schema
val schema = StructType(Array(
StructField("Store", IntegerType, true),
StructField("Date", DateType, true),
StructField("IsHoliday", BooleanType, true),
StructField("Dept", IntegerType, true),
StructField("Weekly_Sales", IntegerType, true),
StructField("Temperature", DoubleType, true),
StructField("Fuel_Price", DoubleType, true),
StructField("MarkDown1", DoubleType, true),
StructField("MarkDown2", DoubleType, true),
StructField("MarkDown3", DoubleType, true),
StructField("MarkDown4", DoubleType, true),
StructField("MarkDown5", DoubleType, true),
StructField("CPI", DoubleType, true),
StructField("Unemployment", DoubleType, true)))
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")
// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()
答案 0 :(得分:1)
它按预期工作。 train_df, features_df
之后schema (14 columns)
与load()
的列相同。
在您的连接条件之后,Seq("Store", "Date", "IsHoliday")
从两个DF(总共3 + 3 = 6列)和join
获取3列并给出一组列名称(3列)。但其余列将来自train_df(rest 11 columns), features_df(rest 11 columns).
因此,printSchema显示25列(3 + 11 + 11)。