我有一个包含两列(“time_stamp”和“message”)的 spark 数据框。
示例数据框:
if(remaining_data.length == 0){
//write code to disable button
}
我想在包含相同 ID 的两行之间创建另一个带有 ID 和不同错误的数据框。
预期输出:
示例表:
Time_stamp Message
2020-12-01 05:28:34:215 some text1 ID: 1
2020-12-01 05:28:40:210 some text2 error: A
2020-12-01 05:28:40:220 some text3 error: B
2020-12-01 05:28:41:203 some text4 error: A
2020-12-01 05:30:43:201 some text5 ID: 1
2020-12-01 05:32:50:215 some text6 ID: 2
2020-12-01 05:32:50:220 some text7 error: A
2020-12-01 05:48:51:220 some text8 error: C
2020-12-01 05:48:52:203 some text9 error: B
2020-12-01 05:51:53:201 some text10 ID: 2
我尝试了以下代码。但是,它使用了 Azure Databricks 不支持的 windows 函数,并且代码需要很长时间才能执行。
ID Error
1 A
1 B
2 A
2 C
2 B
谁能提供 SQL 的解决方案? Azure 数据块很好地支持 Pyspark SQL。
谢谢
答案 0 :(得分:1)
没什么好说的,除了我认为pyspark看起来比spark SQL更好......
df.createOrReplaceTempView('df')
result = spark.sql("""
select ID, error
from (
select *, row_number() over (partition by ID, error order by Time_stamp) rn
from (
select ID, Message[0] error, Message[1] Time_stamp
from (
select ID, explode(Message) Message
from (
select ID, collect_set(array(Message, Time_stamp)) Message
from (
select Time_stamp, regexp_extract(Message, 'error: (.*)', 1) Message, ID
from (
select Time_stamp, Message, last(case when ID != '' then ID end, true) over (order by Time_stamp) ID
from (
select to_timestamp(Time_stamp, 'yyyy-MM-dd HH:mm:ss:SSS') Time_stamp, Message, regexp_extract(Message, 'ID: ([a-zA-Z0-9]+)', 1) ID
from df
)
) where Message rlike 'error'
) group by ID
)
)
)
) where rn = 1 order by Time_stamp""")
result.show()
+---+-----+
| ID|error|
+---+-----+
| 1| A|
| 1| B|
| 2| A|
| 2| C|
| 2| B|
+---+-----+