我有巨大的.csv文件,其中包含多个列,但对我来说重要的列是USER_ID(User Identifier), DURATION(Duration of Call), TYPE(Incoming or Outgoing), DATE, NUMBER(Mobile No.)
。
所以我要做的是:将null
列中的所有DURATION
值替换为average of duration of all the calls of same type by the same user(i.e. of same USER_ID)
。
我发现平均值如下:
在下面的查询中,我找到了同一用户所有相同类型的呼叫的持续时间的平均值。
Dataset<Row> filteredData = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull()).and(col(DATE).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .groupBy(col(USER_ID), col(TYPE), col(NORMALIZE_NUMBER))
/*3*/ .agg(sum(DURATION).alias(DURATION_IN_MIN).divide(count(col(USER_ID))));
filteredData.show()给出:
|USER_ID |type |normalized_number|(sum(duration) AS `durationInMin` / count(USER_ID))|
+--------------------------------+--------+-----------------+---------------------------------------------------+
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+435657456354 |0.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+876454354353 |48.6 |
|8a8a8a8a592b4ace01595e099764000c|INCOMING|+132445686765 |15.0 |
|8a8a8a8a592b4ace01592b4ff4b90000|INCOMING|+097645634324 |74.16666666666667 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+134435657656 |15.0 |
|8a8a8a8a592b4ace01595e70dcbd0016|OUTGOING|+135879878543 |31.0 |
|8a8a8a8a592b4ace0159366a56290005|INCOMING|+768435245243 |11.0 |
|8a8a8a8a592b4ace01592cd8fd160003|INCOMING|+787685534523 |0.0 |
|8a8a8a8a592b4ace01595e65901b0013|OUTGOING|+098976865745 |61.5 |
|8a8a8a8a592b4ace01592b4ff4b90000|OUTGOING|+123456787644 |43.333333333333336 |
在下面的查询中,我正在过滤数据,并在步骤2中将所有null
次出现替换为0.
DataSet<Row> filteredData2 = callLogsDataSet.selectExpr(USER_ID, DURATION, TYPE, DATE, NORMALIZE_NUMBER)
/*1*/ .filter(col(USER_ID).isNotNull().and(col(TYPE).isNotNull()).and(col(NORMALIZE_NUMBER).isNotNull())
.and(col(DATE).gt(0)).and(col(DURATION).gt(0)).and(col(TYPE).isin("OUTGOING","INCOMING")))
/*2*/ .withColumn(DURATION, when(col(DURATION).isNull(), 0).otherwise(col(DURATION).cast(LONG)))
/*3*/ .withColumn(DATE, col(DATE).cast(LONG).minus(col(DATE).cast(LONG).mod(ROUND_ONE_MIN)).cast(LONG))
/*4*/ .groupBy(col(USER_ID), col(DURATION), col(TYPE), col(DATE), col(NORMALIZE_NUMBER))
/*5*/ .agg(sum(DURATION).alias(DURATION_IN_MIN))
/*6*/ .withColumn(DAY_TIME, lit(""))
/*7*/ .withColumn(WEEK_DAY, lit(""))
/*8*/ .withColumn(HOUR_OF_DAY, lit(0));
filteredData2.show()给出:
|USER_ID |duration|type |date |normalized_number|durationInMin|DAY_TIME|WEEK_DAY|HourOfDay|
+--------------------------------+--------+--------+-------------+-----------------+-------------+--------+--------+---------+
|8a8a8a8a592b4ace01595e70dcbd0016|25 |INCOMING|1479017220000|+465435534353 |25 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|29 |INCOMING|1482562560000|+545765765775 |29 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|75 |OUTGOING|1483363980000|+124435665755 |75 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|34 |OUTGOING|1483261920000|+098865563645 |34 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|22 |OUTGOING|1481712180000|+232434656765 |22 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|64 |OUTGOING|1482984060000|+875634521325 |64 | | |0 |
|8a8a8a8a592b4ace0159366a56290005|179 |OUTGOING|1482825060000|+876542543554 |179 | | |0 |
|8a8a8a8a592b4ace01595e65901b0013|12 |OUTGOING|1482393360000|+098634563456 |12 | | |0 |
|8a8a8a8a592b4ace01595e70dcbd0016|14 |OUTGOING|1482820860000|+1344365i8787 |14 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|105 |INCOMING|1478772240000|+234326886784 |105 | | |0 |
|8a8a8a8a592b4ace01592b4ff4b90000|453 |OUTGOING|1480944480000|+134435676578 |453 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|42 |OUTGOING|1483193100000|+413247687686 |42 | | |0 |
|8a8a8a8a592b4ace01595e099764000c|41 |OUTGOING|1481696820000|+134345435645 |41 | | |0 |
请帮我把这两个结合起来或者用这两个得到所需的结果。我是Spark和SparkSQL的新手。
感谢。