使用Apache Beam的Python SDK查找具有最大令牌数的字符串

时间:2019-04-10 07:35:03

标签: python apache-beam

我有一个包含字符串的PCollection。我想按空格分割每个字符串,并找到最大大小的令牌列表,并将大小存储为变量componentWillReceiveProps(nextProps) { const { id, program, fetchData, } = this.props; console.log("isExpanded: ", nextProps.isExpanded) if(nextProps.isExpanded && !program[id]) { fetchData(id); } }

考虑此示例输入:

int

分裂后的句子是:

sentences = ['This is the first sentence',
             'Second sentence',
             'Yet another sentence']

with beam.Pipeline(options=PipelineOptions()) as p:
       pcoll = p | 'Create' >> beam.Create(sentences)

我想将值['This', 'is', 'the', 'first', 'sentence'] -> 5 ['Second', 'sentence'] -> 2 ['Yet', 'another', 'sentence'] -> 3 存储在变量中。

我该怎么做?我碰到了this blogpost,但这并不能完全达到我的目的。作者正在打印出结果PCollection,但我想稍后在管道的其他阶段中使用此值。

1 个答案:

答案 0 :(得分:2)

您可以使用Top.of转换来做到这一点。简要地说,我们拆分每个句子,然后计算标记长度。使用Top时,我们只希望得到第一个结果,然后传递一个lambda函数作为比较标准,以按字长对其进行排序:

sentences = sentences = ['This is the first sentence',
       'Second sentence',
       'Yet another sentence']

longest_sentence = (p
  | 'Read Sentences' >> beam.Create(sentences)
  | 'Split into Words' >> beam.Map(lambda x: x.split(' '))
  | 'Map Token Length'      >> beam.Map(lambda x: (x, len(x)))
  | 'Top Sentence' >> combine.Top.Of(1, lambda a,b: a[1]<b[1])
  | 'Save Variable'         >> beam.ParDo(SaveMaxFn()))

SaveMaxFn()在哪里:

class SaveMaxFn(beam.DoFn):
  """Stores max in global variables"""
  def process(self, element):
    length = element[0][1]
    logging.info("Longest sentence: %s tokens", length)

    return element

length是全局变量:

global length

结果:

INFO:root:Longest sentence: 5 token(s)

完整代码:

import argparse, logging

import apache_beam as beam
import apache_beam.transforms.combiners as combine
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


class SaveMaxFn(beam.DoFn):
  """Stores max in global variables"""
  def process(self, element):
    length = element[0][1]
    logging.info("Longest sentence: %s token(s)", length)

    return element


def run(argv=None):
  parser = argparse.ArgumentParser()
  known_args, pipeline_args = parser.parse_known_args(argv)

  global length

  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = True
  p = beam.Pipeline(options=pipeline_options)

  sentences = sentences = ['This is the first sentence',
             'Second sentence',
             'Yet another sentence']

  longest_sentence = (p
    | 'Read Sentences' >> beam.Create(sentences)
    | 'Split into Words' >> beam.Map(lambda x: x.split(' '))
    | 'Map Token Length'      >> beam.Map(lambda x: (x, len(x)))
    | 'Top Sentence' >> combine.Top.Of(1, lambda a,b: a[1]<b[1])
    | 'Save Variable'         >> beam.ParDo(SaveMaxFn()))

  result = p.run()
  result.wait_until_finish()

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()