获取组中的第一个非空值

时间:2017-08-11 19:03:01

标签: apache-spark pyspark apache-spark-sql pyspark-sql

在Spark SQL中,如何在组中获取第一个非空(或匹配文本,如不是' N / A')。在下面的示例中,用户正在观看电视频道,前3条记录是频道100,SIGNAL_STRENGHT是N / A,其中下一条记录的值为Good,所以我想使用它。

我尝试了windows功能,但我有MAX,MIN等方法

如果我使用铅我只获得下一行,如果我使用无界限,我不会看到像fistNotNull这样的方法。请指教

输入?

CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| N/A
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| N/A
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || N/A
1 || 200 || 7 || N/A
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good

预期输出?

CUSTOMER_ID || TV_CHANNEL_ID || TIME || SIGNAL_STRENGHT
1 || 100 || 0|| Good
1 || 100 || 1|| Good
1 || 100 || 2 || Meduim
1 || 100 || 3|| Poor
1 || 100 || 4|| Poor
1 || 100 || 5 || Meduim
1 || 200 || 6 || Poor
1 || 200 || 7 || Poor
1 || 200 || 8 || Poor
1 || 300 || 9 || Good
1 || 300 || 10 || Good
1 || 300 || 11 || Good

实际代码

    package com.ganesh.test;

    import org.apache.spark.SparkContext;
    import org.apache.spark.sql.*;
    import org.apache.spark.sql.expressions.Window;
    import org.apache.spark.sql.expressions.WindowSpec;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;

    public class ChannelLoader {

        private static final Logger LOGGER = LoggerFactory.getLogger(ChannelLoader.class);

        public static void main(String[] args) throws AnalysisException {
            String master = "local[*]";
            //region
            SparkSession sparkSession = SparkSession
                    .builder()
                    .appName(ChannelLoader.class.getName())
                    .master(master).getOrCreate();
            SparkContext context = sparkSession.sparkContext();
            context.setLogLevel("ERROR");

            SQLContext sqlCtx = sparkSession.sqlContext();

            Dataset<Row> rawDataset = sparkSession.read()
                    .format("com.databricks.spark.csv")
                    .option("delimiter", ",")
                    .option("header", "true")
                    .load("sample_channel.csv");

            rawDataset.printSchema();

            rawDataset.createOrReplaceTempView("channelView");
            //endregion

            WindowSpec windowSpec = Window.partitionBy("CUSTOMER_ID").orderBy("TV_CHANNEL_ID");

            rawDataset = sqlCtx.sql("select * ," +
                    " ( isNan(SIGNAL_STRENGHT) over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING )  ) as updatedStren " +
                    " from channelView " +
                    " order by CUSTOMER_ID, TV_CHANNEL_ID, TIME "
            );

            rawDataset.show();

            sparkSession.close();

        }
    }

更新

我看了很多可能的方法,但没有运气。所以我使用蛮力并得到了预期的结果,我计算了几列并得出了结果。我决定将N / A转换为null,这样当我使用collect_list时它就不会出现。

    rawDataset = sqlCtx.sql("select * " +
            " , ( collect_list(SIGNAL_STRENGTH) " +
            " over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING )  )" +
            " as fwdValues " +
            " , ( collect_list(SIGNAL_STRENGTH) " +
            " over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )  )" +
            " as bkwdValues " +
            " , ( row_number() over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME ) ) as rank_fwd " +
            " , ( row_number() over ( partition by CUSTOMER_ID, TV_CHANNEL_ID order by TIME DESC ) ) as rank_bkwd " +
            " from channelView " +
            " order by CUSTOMER_ID, TV_CHANNEL_ID, TIME "
    );
    rawDataset.show();
    rawDataset.createOrReplaceTempView("updatedChannelView");
    sqlCtx.sql("select * " +
            " , SIGNAL_STRENGTH " +
            ", ( case " +
            "   when (SIGNAL_STRENGTH IS NULL AND rank_bkwd = 1) then bkwdValues[size(bkwdValues)-1] " +
            "   when (SIGNAL_STRENGTH IS NULL ) then fwdValues[0] " +
            "   else SIGNAL_STRENGTH " +
            "  end ) as NEW_SIGNAL_STRENGTH" +
            " from updatedChannelView " +
            ""
    ).show();

代码输出

     +-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
    |CUSTOMER_ID|TV_CHANNEL_ID|TIME|SIGNAL_STRENGTH|           fwdValues|          bkwdValues|rank_fwd|rank_bkwd|SIGNAL_STRENGTH|NEW_SIGNAL_STRENGTH|
    +-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+
    |          1|          100|   0|           null|[Good, Meduim, Poor]|                  []|       1|        6|           null|               Good|
    |          1|          100|   1|           Good|[Good, Meduim, Poor]|              [Good]|       2|        5|           Good|               Good|
    |          1|          100|   2|         Meduim|      [Meduim, Poor]|      [Good, Meduim]|       3|        4|         Meduim|             Meduim|
    |          1|          100|   3|           null|              [Poor]|      [Good, Meduim]|       4|        3|           null|               Poor|
    |          1|          100|   4|           Poor|              [Poor]|[Good, Meduim, Poor]|       5|        2|           Poor|               Poor|
    |          1|          100|   5|           null|                  []|[Good, Meduim, Poor]|       6|        1|           null|               Poor|
    |          1|          200|   6|           null|              [Poor]|                  []|       1|        3|           null|               Poor|
    |          1|          200|   7|           null|              [Poor]|                  []|       2|        2|           null|               Poor|
    |          1|          200|   8|           Poor|              [Poor]|              [Poor]|       3|        1|           Poor|               Poor|
    |          1|          300|  10|           null|              [Good]|                  []|       1|        3|           null|               Good|
    |          1|          300|  11|           null|              [Good]|                  []|       2|        2|           null|               Good|
    |          1|          300|   9|           Good|              [Good]|              [Good]|       3|        1|           Good|               Good|
    +-----------+-------------+----+---------------+--------------------+--------------------+--------+---------+---------------+-------------------+

0 个答案:

没有答案