Spark SQL:从周数和年份

时间:2016-05-30 12:47:18

标签: apache-spark apache-spark-sql

我有一个带有“Week”&的数据框。 “年”列,需要计算月份,如下所示:

输入:

+----+----+
|Week|Year|
+----+----+
|  50|2012|
|  50|2012|
|  50|2012|

预期产出:

+----+----+-----+
|Week|Year|Month|
+----+----+-----+
|  50|2012|12   |
|  50|2012|12   |
|  50|2012|12   |

任何帮助将不胜感激。感谢

1 个答案:

答案 0 :(得分:1)

感谢@ zero323,他向我指出了sqlContext.sql查询,我在下面转换了查询:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import static org.apache.spark.sql.functions.*;

public class MonthFromWeekSparkSQL {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("MonthFromWeekSparkSQL").setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

        List myList = Arrays.asList(RowFactory.create(50, 2012), RowFactory.create(50, 2012), RowFactory.create(50, 2012));
        JavaRDD myRDD = sc.parallelize(myList);

        List<StructField> structFields = new ArrayList<StructField>();

        // Create StructFields
        StructField structField1 = DataTypes.createStructField("week", DataTypes.IntegerType, true);
        StructField structField2 = DataTypes.createStructField("year", DataTypes.IntegerType, true);

        // Add StructFields into list
        structFields.add(structField1);
        structFields.add(structField2);

        // Create StructType from StructFields. This will be used to create DataFrame
        StructType schema = DataTypes.createStructType(structFields);

        DataFrame df = sqlContext.createDataFrame(myRDD, schema);
        DataFrame df2 = df.withColumn("yearAndWeek", concat(col("year"), lit(" "), col("week")))
                .withColumn("month", month(unix_timestamp(col("yearAndWeek"), "yyyy w").cast(("timestamp")))).drop("yearAndWeek");

        df2.show();

    }

}

您实际上创建了一个新的列,其年份和周格式为&#34; yyyy w&#34;然后使用unix_timestamp转换它,你可以从中拉出月份。

PS:似乎投标行为在火花1.5中不正确 - https://issues.apache.org/jira/browse/SPARK-11724

因此,在这种情况下,.cast("double").cast("timestamp")

更常见