我有一组与以下类似的数据,我试图找到一种减少在python中使用spark数据框的方法。
uuid if_id start_time end_time ip_addr
1 03 2018/07/01 13:00:00 2018/07/01 13:00:01 1.1.1.1
1 03 2018/07/01 13:01:05 2018/07/01 13:02:00 1.1.1.1
1 03 2018/07/01 15:00:00 2018/07/01 15:00:30 1.1.1.1
1 03 2018/07/02 01:00:00 2018/07/02 01:00:07 1.2.3.4
1 03 2018/07/02 08:30:00 2018/07/02 08:32:04 1.2.3.4
1 03 2018/07/02 12:00:00 2018/07/02 12:01:00 1.1.1.1
1 05 2018/07/01 15:00:02 2018/07/01 15:00:35 2.2.2.2
1 05 2018/07/01 13:45:23 2018/07/01 13:45:40 2.2.2.2
我需要将上述数据简化为以下内容:
uuid if_id start_time end_time ip_addr
1 03 2018/07/01 13:00:00 2018/07/01 15:00:30 1.1.1.1
1 03 2018/07/02 01:00:00 2018/07/02 08:32:04 1.2.3.4
1 03 2018/07/02 12:00:00 2018/07/02 12:01:00 1.1.1.1
1 05 2018/07/01 13:45:23 2018/07/01 15:00:35 2.2.2.2
最终数据集应代表一个表,该表显示在给定的时间段(从start_time到end_time)中,哪个IP地址分配给了uuid标识的特定主机的接口(if_id)。
如果给定接口不可能随时间更改IP地址(如uuid = 1和if_id = 3的情况),则可以使用groupBy和窗口规范来处理,以提取最小的start_time和最大值时间结束。但是,考虑到地址可以更改,我不确定在不对数据进行多次传递的情况下该如何处理。
任何建议的方法都会受到赞赏。
答案 0 :(得分:0)
使用user8371915建议的链接,我可以提出以下解决方案。
import pyspark.sql.functions as func
from pyspark.sql.window import Window
df = spark.createDataFrame([Row(uuid=1, int_id='03', event_start=701130000, event_end=701130001, ip='1.1.1.1'),
Row(uuid=1, int_id='03', event_start=701130105, event_end=701130200, ip='1.1.1.1'),
Row(uuid=1, int_id='05', event_start=701134523, event_end=701134540, ip='2.2.2.2'),
Row(uuid=1, int_id='03', event_start=701150000, event_end=701150030, ip='1.1.1.1'),
Row(uuid=1, int_id='05', event_start=701150002, event_end=701150035, ip='2.2.2.2'),
Row(uuid=1, int_id='03', event_start=702010000, event_end=702010007, ip='1.2.3.4'),
Row(uuid=1, int_id='03', event_start=702083000, event_end=702083204, ip='1.2.3.4'),
Row(uuid=1, int_id='03', event_start=702120000, event_end=702120100, ip='1.1.1.1')])
window1 = Window.partitionBy('uuid', 'int_id').orderBy('event_start', 'event_end')
window2 = Window.partitionBy('uuid', 'int_id', 'time_group') \
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
# get previous row's ip address
prev_ip = func.lag('ip', 1).over(window1)
#indicate if an IP address change has occurred between current and previous rows
indicator = func.coalesce((col('ip') != prev_ip).cast('integer'), lit(1))
# Cumulative sum of indicators over the window
time_group = func.sum(indicator).over(window1).alias('time_group')
#Add time_group expression to the table:
df = df.select('*', time_group)
# Add begin and end time period for each interface ip address
df = df.select('uuid', 'int_id', 'ip',
func.min('event_start').over(window2).alias('period_begin'),
func.max('event_end').over(window2).alias('period_end')) \
.dropDuplicates() \
.orderBy('uuid', 'int_id', 'period_begin', 'ip')
df.show(truncate=False)
以上内容产生以下结果:
+----+------+-------+------------+----------+
|uuid|int_id|ip |period_begin|period_end|
+----+------+-------+------------+----------+
|1 |03 |1.1.1.1|701130000 |701150030 |
|1 |03 |1.2.3.4|702010000 |702083204 |
|1 |03 |1.1.1.1|702120000 |702120100 |
|1 |05 |2.2.2.2|701134523 |701150035 |
+----+------+-------+------------+----------+