INPUT

primary_key ip_address  unixtimestamp   user_id
1            abc         1421140627       x
2            abc         1421140628       
3            abc         1421140629       y
4            abc         1421140630       z
5            xyz         1421140233       k
6            xyz         1421140234       
7            xyz         1421140235       
8            xyz         1421140236       y
9            xyz         1421140237       n
10           noi         1421140112       f
12           noi         1421140113       
13           noi         1421140114       g
14           noi         1421140115       
15           noi         1421140116       h
16           noi         1421140117 
17           noi         1421140118

输出

primary_key ip_address  unixtimestamp   user_id
1            abc         1421140627      x
2            abc         1421140628      y
3            abc         1421140629      y
4            abc         1421140630      z
5            xyz         1421140233      k
6            xyz         1421140234      y
7            xyz         1421140235      y
8            xyz         1421140236      y
9            xyz         1421140237      n
10           noi         1421140112      f
12           noi         1421140113      g
13           noi         1421140114      g
14           noi         1421140115      h
15           noi         1421140116      h
16           noi         1421140117 
17           noi         1421140118

基本上对于每个ip地址组，如果它为null，我想回填user_id。我已经成功地使用火花数据帧实现了小数据大小，但是当分区的行大小（在这种情况下为ip地址）很高（> 1000万）时，作业永远不会完成。为了让您了解数据大小，总行数约为2亿，分区中的最大行数（IP地址的最大行数）约为1500万）

有人可以帮助我使用scala rdd来实现这一点。提前致谢。

根据要求，请在下面找到我的数据框解决方案。

val partitionWindowWithUnboundedFollowing = Window.partitionBy(ipaddress)
  .orderBy(unixtimestamp)
  .rowsBetween(1, Long.MaxValue)

val input =hc.table("my_data")

val useridIdDerv = input.withColumn(USER_ID_FILLED,min(concat(trim(col(unix_timestamp)),
  lit("-"),trim(col(USER_ID)))).over(partitionWindowWithUnboundedFollowing))

在这两个步骤之后我在USER_ID_FILLED上使用substring函数，然后对userd_id和USER_ID_FILLED执行sql coalesce操作（派生自以上步骤）。

Answer 1

不确定这会显着缩短执行时间，但我认为使用user_id函数first()可以简化ignoreNulls的回填，如下所示：

val df = Seq(
  (1, "abc", 1421140627, "x"),
  (2, "abc", 1421140628, null),
  (3, "abc", 1421140629, "y"),
  (4, "abc", 1421140630, "z"),
  (5, "xyz", 1421140633, "k"),
  (6, "xyz", 1421140634, null),
  (7, "xyz", 1421140635, null),
  (8, "xyz", 1421140636, "y"),
  (9, "xyz", 1421140637, "n"),
  (10, "noi", 1421140112, "f"),
  (12, "noi", 1421140113, null),
  (13, "noi", 1421140114, "g"),
  (14, "noi", 1421140115, null),
  (15, "noi", 1421140116, "h"),
  (16, "noi", 1421140117, null),
  (17, "noi", 1421140118, null)
).toDF("primary_key", "ip_address", "unixtimestamp", "user_id")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val windowSpec = Window.partitionBy("ip_address").orderBy("unixtimestamp").
  rowsBetween(0, Window.unboundedFollowing)

df.withColumn("user_id", first("user_id", ignoreNulls=true).over(windowSpec)).
  orderBy("primary_key").
  show

// +-----------+----------+-------------+-------+
// |primary_key|ip_address|unixtimestamp|user_id|
// +-----------+----------+-------------+-------+
// |          1|       abc|   1421140627|      x|
// |          2|       abc|   1421140628|      y|
// |          3|       abc|   1421140629|      y|
// |          4|       abc|   1421140630|      z|
// |          5|       xyz|   1421140633|      k|
// |          6|       xyz|   1421140634|      y|
// |          7|       xyz|   1421140635|      y|
// |          8|       xyz|   1421140636|      y|
// |          9|       xyz|   1421140637|      n|
// |         10|       noi|   1421140112|      f|
// |         12|       noi|   1421140113|      g|
// |         13|       noi|   1421140114|      g|
// |         14|       noi|   1421140115|      h|
// |         15|       noi|   1421140116|      h|
// |         16|       noi|   1421140117|   null|
// |         17|       noi|   1421140118|   null|
// +-----------+----------+-------------+-------+

[UPDATE]

对于Spark 1.x，DataFrame API中无法使用first(col, ignoreNulls)。这是一个解决方法，可以恢复使用支持ignoreNulls的Spark SQL：

// Might need to use registerTempTable() instead for Spark 1.x
df.createOrReplaceTempView("dfview")

val df2 = spark.sqlContext.sql("""
  select primary_key, ip_address, unixtimestamp,
  first(user_id, true) over (
    partition by ip_address order by unixtimestamp
    rows between current row and unbounded following
  ) as user_id from dfview
  order by primary_key
""")

使用scala rdd实现具有无限跟随的sql窗口函数

INPUT

输出

1 个答案: