spark-sql - 使用嵌套查询过滤数据

时间:2017-01-17 11:17:38

标签: java apache-spark apache-spark-sql apache-spark-2.0 apache-spark-dataset

我有巨大的.csv文件,其中包含多个列,但对我来说重要的列是USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.)

所以我要做的是:将null列中的所有DURATION值替换为average of duration of all the calls of same type by the same user(i.e. of same USER_ID)

我发现平均值如下:

在下面的查询中,我找到了同一用户所有相同类型的呼叫的持续时间的平均值。

Dataset<Row> filteredData = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
      /*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull()).and(col(DATE).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
      /*2*/ .groupBy(col(USER_ID), col(TYPE), col(NORMALIZE_NUMBER))
      /*3*/ .agg(sum(DURATION).alias(DURATION_IN_MIN).divide(count(col(USER_ID))));

filteredData.show()给出:

|USER_ID                         |type    |normalized_number|(sum(duration) AS `durationInMin` / count(USER_ID))|
+--------------------------------+--------+-----------------+---------------------------------------------------+
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+435657456354    |0.0                                                |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+876454354353    |48.6                                               |
|8a8a8a8a592b4ace01595e099764000c|INCOMING|+132445686765    |15.0                                               |
|8a8a8a8a592b4ace01592b4ff4b90000|INCOMING|+097645634324    |74.16666666666667                                  |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+134435657656    |15.0                                               |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+135879878543    |31.0                                               |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+768435245243    |11.0                                               |
|8a8a8a8a592b4ace01592cd8fd160003|INCOMING|+787685534523    |0.0                                                |
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+098976865745    |61.5                                               |
|8a8a8a8a592b4ace01592b4ff4b90000|OUTGOING|+123456787644    |43.333333333333336                                 |

在下面的查询中,我正在过滤数据,并在步骤2中将所有null次出现替换为0.

    DataSet<Row> filteredData2 = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
        /*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull())
                    .and(col(DATE).gt(0)).and(col(DURATION).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
        /*2*/ .withColumn(DURATION, when(col(DURATION).isNull(), 0).otherwise(col(DURATION).cast(LONG)))
        /*3*/ .withColumn(DATE, col(DATE).cast(LONG).minus(col(DATE).cast(LONG).mod(ROUND_ONE_MIN)).cast(LONG))
        /*4*/ .groupBy(col(USER_ID), col(DURATION), col(TYPE), col(DATE), col(NORMALIZE_NUMBER))
        /*5*/ .agg(sum(DURATION).alias(DURATION_IN_MIN))
        /*6*/ .withColumn(DAY_TIME, lit(""))
        /*7*/ .withColumn(WEEK_DAY, lit(""))
        /*8*/ .withColumn(HOUR_OF_DAY, lit(0));

filteredData2.show()给出:

|USER_ID                         |duration|type    |date         |normalized_number|durationInMin|DAY_TIME|WEEK_DAY|HourOfDay|
+--------------------------------+--------+--------+-------------+-----------------+-------------+--------+--------+---------+
|8a8a8a8a592b4ace01595e70dcbd0016|25      |INCOMING|1479017220000|+465435534353    |25           |        |        |0        |
|8a8a8a8a592b4ace01595e099764000c|29      |INCOMING|1482562560000|+545765765775    |29           |        |        |0        |
|8a8a8a8a592b4ace01595e099764000c|75      |OUTGOING|1483363980000|+124435665755    |75           |        |        |0        |
|8a8a8a8a592b4ace01595e70dcbd0016|34      |OUTGOING|1483261920000|+098865563645    |34           |        |        |0        |
|8a8a8a8a592b4ace01595e70dcbd0016|22      |OUTGOING|1481712180000|+232434656765    |22           |        |        |0        |
|8a8a8a8a592b4ace0159366a56290005|64      |OUTGOING|1482984060000|+875634521325    |64           |        |        |0        |
|8a8a8a8a592b4ace0159366a56290005|179     |OUTGOING|1482825060000|+876542543554    |179          |        |        |0        |
|8a8a8a8a592b4ace01595e65901b0013|12      |OUTGOING|1482393360000|+098634563456    |12           |        |        |0        |
|8a8a8a8a592b4ace01595e70dcbd0016|14      |OUTGOING|1482820860000|+1344365i8787    |14           |        |        |0        |
|8a8a8a8a592b4ace01592b4ff4b90000|105     |INCOMING|1478772240000|+234326886784    |105          |        |        |0        |
|8a8a8a8a592b4ace01592b4ff4b90000|453     |OUTGOING|1480944480000|+134435676578    |453          |        |        |0        |
|8a8a8a8a592b4ace01595e099764000c|42      |OUTGOING|1483193100000|+413247687686    |42           |        |        |0        |
|8a8a8a8a592b4ace01595e099764000c|41      |OUTGOING|1481696820000|+134345435645    |41           |        |        |0        |

请帮我把这两个结合起来或者用这两个得到所需的结果。我是Spark和SparkSQL的新手。

感谢。

0 个答案:

没有答案