数据流:在事件流中查找上一个事件

时间:2019-03-21 11:57:31

标签: python google-cloud-platform google-cloud-dataflow apache-beam

在Google Dataflow中恢复我正在寻找与Apache Beam有关的东西,就像Azure流分析中的LAG一样

使用X分钟的窗口接收数据:

||||||  ||||||  ||||||  ||||||  ||||||  ||||||
|  1 |  |  2 |  |  3 |  |  4 |  |  5 |  |  6 | 
|id=x|  |id=x|  |id=x|  |id=x|  |id=x|  |id=x| 
|||||| ,|||||| ,|||||| ,|||||| ,|||||| ,|||||| , ...

我需要将data(n)与data(n-1)进行比较,例如,下面的示例将是这样的:

if data(6) inside and data(5)  outside then ... 
if data(5) inside and data(4)  outside then ... 
if data(4) inside and data(3)  outside then ... 
if data(3) inside and data(2)  outside then ... 
if data(2) inside and data(1)  outside then ... 

有什么“实用”的方法吗?

1 个答案:

答案 0 :(得分:1)

使用Beam,如docs中所述,每个键和窗口都保持状态。因此,您不能从以前的窗口访问值。

要做您想做的事情,您可能需要更复杂的管道设计。我的想法(以此处的示例为例)是在ParDo中复制您的消息:

  • 将它们修改为主要输出
  • 同时,将它们发送到具有单窗口滞后的侧面输出

要完成第二个要点,我们可以将窗口的持续时间(WINDOW_SECONDS)添加到元素时间戳:

class DuplicateWithLagDoFn(beam.DoFn):

  def process(self, element, timestamp=beam.DoFn.TimestampParam):
    # Main output gets unmodified element
    yield element
    # The same element is emitted to the side output with a 1-window lag added to timestamp
    yield beam.pvalue.TaggedOutput('lag_output', beam.window.TimestampedValue(element, timestamp + WINDOW_SECONDS))

我们调用指定正确标签的函数:

beam.ParDo(DuplicateWithLagDoFn()).with_outputs('lag_output', main='main_output')

,然后将相同的窗口化方案应用于这两个窗口,并按键进行分组等等。

windowed_main = results.main_output | 'Window main output' >> beam.WindowInto(window.FixedWindows(WINDOW_SECONDS))
windowed_lag = results.lag_output | 'Window lag output' >> beam.WindowInto(window.FixedWindows(WINDOW_SECONDS))

merged = (windowed_main, windowed_lag) | 'Join Pcollections' >> beam.CoGroupByKey()

最后,我们可以在同一个ParDo中同时拥有两个值(旧值和新值):

class CompareDoFn(beam.DoFn):

  def process(self, element):
    logging.info("Combined with previous vale: {}".format(element))

    try:
      old_value = int(element[1][1][0].split(',')[1])
    except:
      old_value = 0

    try:
      new_value = int(element[1][0][0].split(',')[1])
    except:
      new_value = 0

    logging.info("New value: {}, Old value: {}, Difference: {}".format(new_value, old_value, new_value - old_value))
    return (element[0], new_value - old_value)

要对此进行测试,我使用直接运行程序运行管道,并在单独的shell上以10秒以上的间隔发布两条消息(在我的情况下,WINDOW_SECONDS为10s):

gcloud pubsub topics publish lag --message="test,120"
sleep 12
gcloud pubsub topics publish lag --message="test,40"

作业输出显示预期的差异:

INFO:root:New message: (u'test', u'test,120')
INFO:root:Combined with previous vale: (u'test', ([u'test,120'], []))
INFO:root:New value: 120, Old value: 0, Difference: 120
INFO:root:New message: (u'test', u'test,40')
INFO:root:Combined with previous vale: (u'test', ([u'test,40'], [u'test,120']))
INFO:root:New value: 40, Old value: 120, Difference: -80
INFO:root:Combined with previous vale: (u'test', ([], [u'test,40']))
INFO:root:New value: 0, Old value: 40, Difference: -40

我的示例here的完整代码。在复制元素时要考虑到性能方面的考虑,但是如果需要在两个窗口中使用可用的值,这是有意义的。