我是scala的新手,我想使用scala rdd'来实现以下功能。请帮帮我。
primary_key ip_address unixtimestamp user_id
1 abc 1421140627 x
2 abc 1421140628
3 abc 1421140629 y
4 abc 1421140630 z
5 xyz 1421140233 k
6 xyz 1421140234
7 xyz 1421140235
8 xyz 1421140236 y
9 xyz 1421140237 n
10 noi 1421140112 f
12 noi 1421140113
13 noi 1421140114 g
14 noi 1421140115
15 noi 1421140116 h
16 noi 1421140117
17 noi 1421140118
primary_key ip_address unixtimestamp user_id
1 abc 1421140627 x
2 abc 1421140628 y
3 abc 1421140629 y
4 abc 1421140630 z
5 xyz 1421140233 k
6 xyz 1421140234 y
7 xyz 1421140235 y
8 xyz 1421140236 y
9 xyz 1421140237 n
10 noi 1421140112 f
12 noi 1421140113 g
13 noi 1421140114 g
14 noi 1421140115 h
15 noi 1421140116 h
16 noi 1421140117
17 noi 1421140118
基本上对于每个ip地址组,如果它为null,我想回填user_id。我已经成功地使用火花数据帧实现了小数据大小,但是当分区的行大小(在这种情况下为ip地址)很高(> 1000万)时,作业永远不会完成。为了让您了解数据大小,总行数约为2亿,分区中的最大行数(IP地址的最大行数)约为1500万)
有人可以帮助我使用scala rdd来实现这一点。提前致谢。
根据要求,请在下面找到我的数据框解决方案。
val partitionWindowWithUnboundedFollowing = Window.partitionBy(ipaddress)
.orderBy(unixtimestamp)
.rowsBetween(1, Long.MaxValue)
val input =hc.table("my_data")
val useridIdDerv = input.withColumn(USER_ID_FILLED,min(concat(trim(col(unix_timestamp)),
lit("-"),trim(col(USER_ID)))).over(partitionWindowWithUnboundedFollowing))
在这两个步骤之后我在USER_ID_FILLED上使用substring函数,然后对userd_id和USER_ID_FILLED执行sql coalesce操作(派生自 以上步骤)。
答案 0 :(得分:0)
不确定这会显着缩短执行时间,但我认为使用user_id
函数first()
可以简化ignoreNulls
的回填,如下所示:
val df = Seq(
(1, "abc", 1421140627, "x"),
(2, "abc", 1421140628, null),
(3, "abc", 1421140629, "y"),
(4, "abc", 1421140630, "z"),
(5, "xyz", 1421140633, "k"),
(6, "xyz", 1421140634, null),
(7, "xyz", 1421140635, null),
(8, "xyz", 1421140636, "y"),
(9, "xyz", 1421140637, "n"),
(10, "noi", 1421140112, "f"),
(12, "noi", 1421140113, null),
(13, "noi", 1421140114, "g"),
(14, "noi", 1421140115, null),
(15, "noi", 1421140116, "h"),
(16, "noi", 1421140117, null),
(17, "noi", 1421140118, null)
).toDF("primary_key", "ip_address", "unixtimestamp", "user_id")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("ip_address").orderBy("unixtimestamp").
rowsBetween(0, Window.unboundedFollowing)
df.withColumn("user_id", first("user_id", ignoreNulls=true).over(windowSpec)).
orderBy("primary_key").
show
// +-----------+----------+-------------+-------+
// |primary_key|ip_address|unixtimestamp|user_id|
// +-----------+----------+-------------+-------+
// | 1| abc| 1421140627| x|
// | 2| abc| 1421140628| y|
// | 3| abc| 1421140629| y|
// | 4| abc| 1421140630| z|
// | 5| xyz| 1421140633| k|
// | 6| xyz| 1421140634| y|
// | 7| xyz| 1421140635| y|
// | 8| xyz| 1421140636| y|
// | 9| xyz| 1421140637| n|
// | 10| noi| 1421140112| f|
// | 12| noi| 1421140113| g|
// | 13| noi| 1421140114| g|
// | 14| noi| 1421140115| h|
// | 15| noi| 1421140116| h|
// | 16| noi| 1421140117| null|
// | 17| noi| 1421140118| null|
// +-----------+----------+-------------+-------+
[UPDATE]
对于Spark 1.x,DataFrame API中无法使用first(col, ignoreNulls)
。这是一个解决方法,可以恢复使用支持ignoreNulls的Spark SQL:
// Might need to use registerTempTable() instead for Spark 1.x
df.createOrReplaceTempView("dfview")
val df2 = spark.sqlContext.sql("""
select primary_key, ip_address, unixtimestamp,
first(user_id, true) over (
partition by ip_address order by unixtimestamp
rows between current row and unbounded following
) as user_id from dfview
order by primary_key
""")