如何在Spark2中具有数百万条记录的文件上生成序列(每日增量加载)

时间:2018-07-19 06:59:55

标签: scala apache-spark

我有一个业务场景,可在带有scala 2.11.8的spark 2.0中的每日增量表或文件上生成代理密钥。我知道“ zipwithindex”,“ row_num”和“ monotonically_increasing_id()”,但是它们都不适合每日增量负载,因为今天的负载我的序列将是1 +昨天的序列。 累加器也不能工作,因为它只能写。

例如场景:到昨天的负载为止,我的上一个customer_sk为1001,现在在今天的负载中,我想将customer_sk设置为从1002开始到文件结尾。

注意:我将有数百万行,该程序将在多个节点上并行运行。

预先感谢

2 个答案:

答案 0 :(得分:0)

1)从表中获取最大值customer_sk

2),然后在使用row_num时添加此最大customer_sk号,以使序列从此继续。

如果还使用rdd,请将先前的最大数字添加到(zipwithindex +1)

答案 1 :(得分:0)

适用于所有仍在寻找示例代码答案的人。

hdfs dfs -cat /user/shahabhi/test_file_2.csv

abhishek,shah,123,浦那,2018-12-31,2018-11-30 abhishek,shah,123,浦那,2018-12-31,2018-11-30 拉维,夏尔马,464,孟买,20181231,20181130 Mitesh,shah,987,satara,2018-12-31,2018-11-30 shalabh,nagar,981,satara,2018-12-31,2018-11-30 盖拉夫,梅赫塔,235,乌贾因,2018/12 / 31,2018 / 11/30 Gaurav,mehta,235,ujjain,2018-12-31,2018-11-30 vikas,khanna,123,ujjain,2018-12-31,2018-11-30 vinayak,kale,789,pune,2018-12-31,2018-11-30

火花代码-

import org.apache.spark.sql.functions.monotonically_increasing_id

val df =spark.read.csv("/user/shahabhi/test_file_2.csv").toDF("name","lname","d_code","city","esd","eed")

df.show()

+ -------- + ------ + ------ + ------ + ---------- + ----- ----- + |名称| lname | d_code |城市| esd | eed | + -------- + ------ + ------ + ------ + ---------- + -------- -+ | abhishek |莎| 123 |浦那| 2018-12-31 | 2018-11-30 | | abhishek |莎| 123 |浦那| 2018-12-31 | 2018-11-30 | | ravi | Sharma | 464 |孟买| 20181231 | 20181130 | |米特什|莎| 987 | satara | 2018-12-31 | 2018-11-30 | | shalabh |纳加尔| 981 |星期六| 2018-12-31 | 2018-11-30 | |高拉夫| mehta | 235 | ujjain | 2018/12/31 | 2018/11/30 | |高拉夫| mehta | 235 | ujjain | 2018-12-31 | 2018-11-30 | | vikas | khanna | 123 | ujjain | 2018-12-31 | 2018-11-30 | | vinayak |羽衣甘蓝| 789 |浦那| 2018-12-31 | 2018-11-30 | + -------- + ------ + ------ + ------ + ---------- + -------- -+

val df_2=df.withColumn("surrogate_key", monotonically_increasing_id())

df_2.show()

+ -------- + ------ + ------ + ------ + ---------- + ----- ----- + ------------- + |名称| lname | d_code |城市| esd | eed |代理键| + -------- + ------ + ------ + ------ + ---------- + -------- -+ ------------- + | abhishek |莎| 123 |浦那| 2018-12-31 | 2018-11-30 | 0 | | abhishek |莎| 123 |浦那| 2018-12-31 | 2018-11-30 | 1 | | ravi | Sharma | 464 |孟买| 20181231 | 20181130 | 2 | |米特什|莎| 987 | satara | 2018-12-31 | 2018-11-30 | 3 | | shalabh |纳加尔| 981 |星期六| 2018-12-31 | 2018-11-30 | 4 | |高拉夫| mehta | 235 | ujjain | 2018/12/31 | 2018/11/30 | 5 | |高拉夫| mehta | 235 | ujjain | 2018-12-31 | 2018-11-30 | 6 | | vikas | khanna | 123 | ujjain | 2018-12-31 | 2018-11-30 | 7 | | vinayak |羽衣甘蓝| 789 |浦那| 2018-12-31 | 2018-11-30 | 8 | + -------- + ------ + ------ + ------ + ---------- + -------- -+ ------------- +

val df_3=df.withColumn("surrogate_key", monotonically_increasing_id()+1000)
df_3.show()

+ -------- + ------ + ------ + ------ + ---------- + ----- ----- + ------------- + |名称| lname | d_code |城市| esd | eed |代理键| + -------- + ------ + ------ + ------ + ---------- + -------- -+ ------------- + | abhishek |莎| 123 |浦那| 2018-12-31 | 2018-11-30 | 1000 | | abhishek |莎| 123 |浦那| 2018-12-31 | 2018-11-30 | 1001 | | ravi | Sharma | 464 |孟买| 20181231 | 20181130 | 1002 | |米特什|莎| 987 | satara | 2018-12-31 | 2018-11-30 | 1003 | | shalabh |纳加尔| 981 |星期六| 2018-12-31 | 2018-11-30 | 1004 | |高拉夫| mehta | 235 | ujjain | 2018/12/31 | 2018/11/30 | 1005 | |高拉夫| mehta | 235 | ujjain | 2018-12-31 | 2018-11-30 | 1006 | | vikas | khanna | 123 | ujjain | 2018-12-31 | 2018-11-30 | 1007 | | vinayak |羽衣甘蓝| 789 |浦那| 2018-12-31 | 2018-11-30 | 1008 | + -------- + ------ + ------ + ------ + ---------- + -------- -+ ------------- +