Python Apache Beam多个输出和处理

时间:2018-08-29 10:48:51

标签: python apache-beam

我对Apache Beam概念还很陌生,并尝试使用以下流程在Google Dataflow上运行作业: process_flow

本质上是采用单个数据源,基于字典中的某些值进行过滤,并为每个过滤条件创建单独的输出。

我编写了以下代码:

# List of values to filter by
x_list = [1, 2, 3]

with beam.Pipeline(options=PipelineOptions().from_dictionary(pipeline_params)) as p:
    # Read in newline JSON data - each line is a dictionary
    log_data = (
        p 
        | "Create " + input_file >> beam.io.textio.ReadFromText(input_file)
        | "Load " + input_file >> beam.FlatMap(lambda x: json.loads(x))
    )

    # For each value in x_list, filter log_data for dictionaries containing the value & write out to separate file
    for i in x_list:
        # Return dictionary if given key = value in filter
        filtered_log = log_data | "Filter_"+i >> beam.Filter(lambda x: x['key'] == i)
        # Do additional processing
        processed_log = process_pcoll(filtered_log, event)
        # Write final file
        output = (
            processed_log
            | 'Dump_json_'+filename >> beam.Map(json.dumps)
            | "Save_"+filename >> beam.io.WriteToText(output_fp+filename,num_shards=0,shard_name_template="")
        )

当前,它仅处理列表中的第一个值。我知道我可能必须使用ParDo,但是我不确定如何将其纳入我的流程。

感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

您可以在Beam中使用TaggedOutput.Write一个Beam函数,该函数将标记pcollection中的每个元素。

classes(item) {
 return form.score === item ? ['text-black', 'bg-white'] : ''
}

现在您可以将此输出写入单独的文件/表

# forms.py
from django import forms
from django.utils.translation import gettext_lazy as _

class PermissionModelMultipleChoiceField(forms.ModelMultipleChoiceField):
def label_from_instance(self, obj):
    permissions_translated = [_(w).replace('Can', 'Pode').replace('add', 'adicionar').replace('change', 'alterar').replace('delete', 'excluir').replace('view', 'visualizar') for w in (obj.name).split()]
    return ' '.join(permissions_translated)

希望有帮助!

来源:[https://beam.apache.org/documentation/sdks/pydoc/2.0.0/_modules/apache_beam/pvalue.html]