如何查找Spark每年的平均收盘价?
sqlContext.sql(
"select avg(closeprice), year(dt) from df0 group by year(dt)"
).show
难点在于“dt”列是StringType。我无法将stringType列强制转换为年/月/日。请评论。 架构如下:
val customSchema = StructType(Array(
StructField("dt", StringType, true),
StructField("openprice", DoubleType, true),
StructField("highprice", DoubleType, true),
StructField("lowprice", DoubleType, true),
StructField("closeprice", DoubleType, true),
StructField("volume", IntegerType, true)
))
val df0 = sqlContext.read.format("com.databricks.spark.csv").option("delimiter",",").option("header", "true").schema(customSchema).load("./data/GSPC.csv")
df0.show()
df0.printSchema()
+----------+---------+---------+---------+----------+-------+
| dt|openprice|highprice| lowprice|closeprice| volume|
+----------+---------+---------+---------+----------+-------+
|1950-01-03| 16.66| 16.66| 16.66| 16.66|1260000|
|1950-01-04| 16.85| 16.85| 16.85| 16.85|1890000|
|1950-01-05| 16.93| 16.93| 16.93| 16.93|2550000|
|1950-01-06| 16.98| 16.98| 16.98| 16.98|2010000|
|1950-01-09| 17.09| 17.09| 17.08| 17.08|3850000|
|1950-01-10|17.030001|17.030001|17.030001| 17.030001|2160000|
|1950-01-11| 17.09| 17.09| 17.09| 17.09|2630000|
|1950-01-12| 16.76| 16.76| 16.76| 16.76|2970000|
|1950-01-13| 16.67| 16.67| 16.67| 16.67|3330000|
|1950-01-16| 16.65|16.719999| 16.65| 16.719999|2640000|
|1950-01-17|16.860001|16.860001|16.860001| 16.860001|1790000|
|1950-01-18| 16.85| 16.85| 16.85| 16.85|1570000|
|1950-01-19|16.870001|16.870001|16.870001| 16.870001|1170000|
|1950-01-20| 16.9| 16.9| 16.9| 16.9|1440000|
|1950-01-23|16.940001|16.940001| 16.92| 16.92|1890000|
|1950-01-24|16.860001|16.860001|16.860001| 16.860001|1250000|
|1950-01-25| 16.74| 16.74| 16.74| 16.74|1700000|
|1950-01-26| 16.73| 16.73| 16.73| 16.73|1150000|
|1950-01-27| 16.82| 16.82| 16.82| 16.82|1250000|
|1950-01-30| 16.9| 17.02| 16.9| 17.02|2380000|
+----------+---------+---------+---------+----------+-------+
root
|-- dt: string (nullable = true)
|-- openprice: double (nullable = true)
|-- highprice: double (nullable = true)
|-- lowprice: double (nullable = true)
|-- closeprice: double (nullable = true)
|-- volume: integer (nullable = true)
答案 0 :(得分:1)
最新的Spark使用to_date
:
scala> spark.sql("SELECT YEAR(TO_DATE('3-1-1959', 'dd-MM-yyyy'))").show
+---------------------------------------+
|year(to_date('3-1-1959', 'dd-MM-yyyy'))|
+---------------------------------------+
| 1959|
+---------------------------------------+
旧版本:
scala> spark.sql("SELECT YEAR(CAST(UNIX_TIMESTAMP('3-1-1959', 'dd-MM-yyyy') AS TIMESTAMP))").show
+---------------------------------------------------------------------------+
|year(CAST(CAST(unix_timestamp(3-1-1959, dd-MM-yyyy) AS TIMESTAMP) AS DATE))|
+---------------------------------------------------------------------------+
| 1959|
+---------------------------------------------------------------------------+
答案 1 :(得分:0)
感谢您的帮助。我更喜欢完整的SQL,因为我是Oracle DBA并且我做了很多SQL。无论如何,我发布了关于我做了什么的答案,请随时评论和建议。谢谢。
// Creates a temporary view using the DataFrame
df0.createOrReplaceTempView("GSPC")
// Compute the average closing price per year for GSPC
val results = spark.sql("SELECT YEAR(TO_DATE(dt, 'yyyy-MM-dd')) AS YEAR, avg(closeprice) FROM GSPC GROUP BY YEAR ORDER BY YEAR DESC")
results.show()