DecimalType问题 - 类java.lang.String

时间:2016-09-24 08:58:24

标签: scala apache-spark

我正在使用内置Scala 2.10.5的Spark 1.6.1。我正在检查一些天气数据,有时候我有十进制值。这是代码:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.implicits._

 import org.apache.spark.sql.Row
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql._
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.SQLContext

val rawData=sc.textFile("Example_Weather.csv").map(_.split(","))

val header=rawData.first

val rawDataNoHeader=rawData.filter(_(0)!= header(0))

rawDataNoHeader.first

object schema {
val weatherdata= StructType(Seq(
StructField("date", StringType, true),
StructField("Region", StringType, true),
StructField("Temperature", DecimalType(32,16), true),
StructField("Solar", IntegerType, true),
StructField("Rainfall", DecimalType(32,16), true), 
StructField("WindSpeed", DecimalType(32,16), true))
)
}

val dataDF=sqlContext.createDataFrame(rawDataNoHeader.map(p=>Row(p(0),p(1),p(2),p(3),p(4),p(5))), schema.weatherdata)

dataDF.registerTempTable("weatherdataSQL")

val datasql = sqlContext.sql("SELECT * FROM weatherdataSQL")

datasql.collect().foreach(println)

运行代码时,我得到了架构和sqlContext的预期内容:

scala> object schema {
 | val weatherdata= StructType(Seq(
 | StructField("date", StringType, true),
 | StructField("Region", StringType, true),
 | StructField("Temperature", DecimalType(32,16), true),
 | StructField("Solar", IntegerType, true),
 | StructField("Rainfall", DecimalType(32,16), true),
 | StructField("WindSpeed", DecimalType(32,16), true))
 | )
 | }
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:56288 in memory (size: 4.6 KB, free: 511.1 MB)
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:39349 in memory (size: 4.6 KB, free: 2.7 GB)
16/09/24 09:40:58 INFO ContextCleaner: Cleaned accumulator 2
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost in memory (size: 1964.0 B, free: 511.1 MB)
16/09/24 09:40:58 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:41412 in memory (size: 1964.0 B, free: 2.7 GB)
16/09/24 09:40:58 INFO ContextCleaner: Cleaned accumulator 1
defined module schema

scala> val dataDF=sqlContext.createDataFrame(rawDataNoHeader.map(p=>Row(p(0),p(1),p(2),p(3),p(4),p(5))), schema.weatherdata)
dataDF: org.apache.spark.sql.DataFrame = [date: string, Region: string, Temperature: decimal(32,16), Solar: int, Rainfall: decimal(32,16), WindSpeed: decimal(32,16)]

但是,最后一行代码给出了以下内容:

16/09/24 09:41:03 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): scala.MatchError: 20.21666667 (of class java.lang.String)

数字 20.21666667 确实是特定地理区域观测到的第一个温度。我以为我已成功指定温度为十进制类型(32,16)。我的代码或我调用的sqlContext是否有问题?

根据建议,我将dataDF更改为:

val dataDF= sqlContext.createDataFrame(rawDataNoHeader.map(p=>Row(p(0),p(1),BigDecimal(p(2)),p(3),BigDecimal(p(4)),BigDecimal(p(5)))), schema.weatherdata)

不幸的是,我现在遇到了投射问题

16/09/24 10:31:35 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

3 个答案:

答案 0 :(得分:1)

第一次编辑中的代码几乎是正确的 - p(3)必须转换为 toInt

我创建了一个没有标题的示例csv文件:

2016,a,201.222,12,12.1,5.0
2016,b,200.222,13,12.3,6.0
2014,b,200.111,14,12.3,7.0

结果:

val dataDF= sqlContext.createDataFrame(rawData.map(p=>Row(p(0),p(1),BigDecimal(p(2)),p(3).toInt,BigDecimal(p(4)),BigDecimal(p(5)))), schema.weatherdata)

dataDF.show
+----+------+--------------------+-----+-------------------+------------------+
|date|Region|         Temperature|Solar|           Rainfall|         WindSpeed|
+----+------+--------------------+-----+-------------------+------------------+
|2016|     a|201.2220000000000000|   12|12.1000000000000000|5.0000000000000000|
|2016|     b|200.2220000000000000|   13|12.3000000000000000|6.0000000000000000|
|2014|     b|200.1110000000000000|   14|12.3000000000000000|7.0000000000000000|
+----+------+--------------------+-----+-------------------+------------------+

答案 1 :(得分:0)

这可能是因为您正在从.csv文件中读取此数据。默认情况下,它将数据作为" Text / String" format.You可以通过2种方式解决这个问题 1.在.csv文件中更改属性温度的数据类型。 2.val temperatureInDecimal = BigDecimal(" 20.21666667")

如果你想从未来的角度制作你的应用程序,我会建议使用第二种方法.csv文件可以改变。

答案 2 :(得分:0)

由于您了解预期的架构,因此最好跳过手动解析并使用正确的输入格式。对于Spark 1.6 / Scala 2.10,包括spark-csv包(--packages com.databricks:spark-csv_2.10:1.4.0)和:

val sqlContext: SQLContext = ???
val path: String = ???

sqlContext.read
  .format("csv")
  .schema(schema.weatherdata).option("header", "true")
  .load(path)

对于2.0 +:

val spark: SparkSession = ???
val path: String = ???

spark.read
  .format("csv")
  .schema(schema.weatherdata).option("header", "true")
  .load(path)