如何找到Spark每年的平均收盘价?

时间:2018-02-01 14:50:20

标签: scala apache-spark

如何查找Spark每年的平均收盘价?

sqlContext.sql(
  "select avg(closeprice), year(dt) from df0 group by year(dt)"
).show

难点在于“dt”列是StringType。我无法将stringType列强制转换为年/月/日。请评论。 架构如下:

val customSchema = StructType(Array(
            StructField("dt", StringType, true),
            StructField("openprice", DoubleType, true),
            StructField("highprice", DoubleType, true),
            StructField("lowprice", DoubleType, true),
            StructField("closeprice", DoubleType, true),
            StructField("volume", IntegerType, true)
            ))

val df0 = sqlContext.read.format("com.databricks.spark.csv").option("delimiter",",").option("header", "true").schema(customSchema).load("./data/GSPC.csv")

df0.show()
df0.printSchema()

+----------+---------+---------+---------+----------+-------+
|        dt|openprice|highprice| lowprice|closeprice| volume|
+----------+---------+---------+---------+----------+-------+
|1950-01-03|    16.66|    16.66|    16.66|     16.66|1260000|
|1950-01-04|    16.85|    16.85|    16.85|     16.85|1890000|
|1950-01-05|    16.93|    16.93|    16.93|     16.93|2550000|
|1950-01-06|    16.98|    16.98|    16.98|     16.98|2010000|
|1950-01-09|    17.09|    17.09|    17.08|     17.08|3850000|
|1950-01-10|17.030001|17.030001|17.030001| 17.030001|2160000|
|1950-01-11|    17.09|    17.09|    17.09|     17.09|2630000|
|1950-01-12|    16.76|    16.76|    16.76|     16.76|2970000|
|1950-01-13|    16.67|    16.67|    16.67|     16.67|3330000|
|1950-01-16|    16.65|16.719999|    16.65| 16.719999|2640000|
|1950-01-17|16.860001|16.860001|16.860001| 16.860001|1790000|
|1950-01-18|    16.85|    16.85|    16.85|     16.85|1570000|
|1950-01-19|16.870001|16.870001|16.870001| 16.870001|1170000|
|1950-01-20|     16.9|     16.9|     16.9|      16.9|1440000|
|1950-01-23|16.940001|16.940001|    16.92|     16.92|1890000|
|1950-01-24|16.860001|16.860001|16.860001| 16.860001|1250000|
|1950-01-25|    16.74|    16.74|    16.74|     16.74|1700000|
|1950-01-26|    16.73|    16.73|    16.73|     16.73|1150000|
|1950-01-27|    16.82|    16.82|    16.82|     16.82|1250000|
|1950-01-30|     16.9|    17.02|     16.9|     17.02|2380000|
+----------+---------+---------+---------+----------+-------+

root
 |-- dt: string (nullable = true)
 |-- openprice: double (nullable = true)
 |-- highprice: double (nullable = true)
 |-- lowprice: double (nullable = true)
 |-- closeprice: double (nullable = true)
 |-- volume: integer (nullable = true)

2 个答案:

答案 0 :(得分:1)

最新的Spark使用to_date

scala> spark.sql("SELECT YEAR(TO_DATE('3-1-1959', 'dd-MM-yyyy'))").show
+---------------------------------------+
|year(to_date('3-1-1959', 'dd-MM-yyyy'))|
+---------------------------------------+
|                                   1959|
+---------------------------------------+

旧版本:

scala> spark.sql("SELECT YEAR(CAST(UNIX_TIMESTAMP('3-1-1959', 'dd-MM-yyyy') AS TIMESTAMP))").show
+---------------------------------------------------------------------------+
|year(CAST(CAST(unix_timestamp(3-1-1959, dd-MM-yyyy) AS TIMESTAMP) AS DATE))|
+---------------------------------------------------------------------------+
|                                                                       1959|
+---------------------------------------------------------------------------+

答案 1 :(得分:0)

感谢您的帮助。我更喜欢完整的SQL,因为我是Oracle DBA并且我做了很多SQL。无论如何,我发布了关于我做了什么的答案,请随时评论和建议。谢谢。

// Creates a temporary view using the DataFrame
df0.createOrReplaceTempView("GSPC")

// Compute the average closing price per year for GSPC
val results = spark.sql("SELECT YEAR(TO_DATE(dt, 'yyyy-MM-dd')) AS YEAR, avg(closeprice) FROM GSPC GROUP BY YEAR ORDER BY YEAR DESC")

results.show()