数据流/ Apache Beam-传递模式时如何访问当前文件名?

时间:2018-11-21 02:42:15

标签: python google-cloud-platform google-bigquery google-cloud-dataflow apache-beam

我之前在堆栈溢出(https://stackoverflow.com/questions/29983621/how-to-get-filename-when-using-file-pattern-match-in-google-cloud-dataflow)上已经看到了这个问题的答案,但是由于apache beam为python添加了可拆分的dofn功能,所以没有。将文件模式传递到gcs存储桶时,如何访问当前正在处理的文件的文件名?

我想将文件名传递到我的转换函数中:

with beam.Pipeline(options=pipeline_options) as p:                              
    lines = p | ReadFromText('gs://url to file')                                        


    data = (                                                                    
        lines                                                                   
        | 'Jsonify' >> beam.Map(jsonify)                                        
        | 'Unnest' >> beam.FlatMap(unnest)                                      
        | 'Write to BQ' >> beam.io.Write(beam.io.BigQuerySink(                  
            'project_id:dataset_id.table_name', schema=schema,                     
            create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,    
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)       
        )                                                   

最终,我要做的是在转换json的每一行时将文件名传递到我的转换函数中(请参阅this,然后使用文件名在不同的BQ表中进行查找以获取值)。我想,一旦我设法知道如何获取文件名,我就可以弄清侧面输入部分,以便在bq表中进行查找并获得唯一的值。

2 个答案:

答案 0 :(得分:4)

我尝试使用先前引用的case实现解决方案。在那里,以及诸如this one之类的其他方法中,它们也获得文件名列表,但是将所有文件加载到单个元素中,这可能无法很好地处理大型文件。因此,我考虑将文件名添加到每个记录中。

作为输入,我使用了两个csv文件:

$ gsutil cat gs://$BUCKET/countries1.csv
id,country
1,sweden
2,spain

gsutil cat gs://$BUCKET/countries2.csv
id,country
3,italy
4,france

使用GCSFileSystem.match,我们可以访问metadata_list来检索FileMetadata,其中包含文件路径和大小(以字节为单位)。在我的示例中:

[FileMetadata(gs://BUCKET_NAME/countries1.csv, 29),
 FileMetadata(gs://BUCKET_NAME/countries2.csv, 29)]

代码是:

result = [m.metadata_list for m in gcs.match(['gs://{}/countries*'.format(BUCKET)])]

我们将每个匹配的文件读入不同的PCollection中。由于我们不知道先验文件的数量,因此需要以编程方式为每个PCollection (p0, p1, ..., pN-1)创建一个名称列表,并确保每个步骤('Read file 0', 'Read file 1', etc.)都有唯一的标签:

variables = ['p{}'.format(i) for i in range(len(result))]
read_labels = ['Read file {}'.format(i) for i in range(len(result))]
add_filename_labels = ['Add filename {}'.format(i) for i in range(len(result))]

然后,我们继续使用ReadFromText将每个不同的文件读入其对应的PCollection中,然后调用AddFilenamesFn ParDo将每个记录与文件名相关联。

for i in range(len(result)):   
  globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.ParDo(AddFilenamesFn(), result[i].path)

其中AddFilenamesFn是:

class AddFilenamesFn(beam.DoFn):
    """ParDo to output a dict with filename and row"""
    def process(self, element, file_path):
        file_name = file_path.split("/")[-1]
        yield {'filename':file_name, 'row':element}

我的第一种方法是直接使用Map函数,这会使代码更简单。但是,result[i].path在循环结束时已解决,并且每个记录都错误地映射到列表的最后一个文件:

globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.Map(lambda elem: (result[i].path, elem))

最后,我们将所有PCollection展平为一个:

merged = [globals()[variables[i]] for i in range(len(result))] | 'Flatten PCollections' >> beam.Flatten()

,我们通过记录以下元素来检查结果:

INFO:root:{'filename': u'countries2.csv', 'row': u'id,country'}
INFO:root:{'filename': u'countries2.csv', 'row': u'3,italy'}
INFO:root:{'filename': u'countries2.csv', 'row': u'4,france'}
INFO:root:{'filename': u'countries1.csv', 'row': u'id,country'}
INFO:root:{'filename': u'countries1.csv', 'row': u'1,sweden'}
INFO:root:{'filename': u'countries1.csv', 'row': u'2,spain'}

我针对Python SDK 2.8.0使用DirectRunnerDataflowRunner进行了测试。

我希望这可以解决这里的主要问题,您现在可以将BigQuery集成到完整的用例中,以继续进行。为此,您可能需要使用Python客户端库,我编写了类似的Java example

完整代码:

import argparse, logging
from operator import add

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.io import ReadFromText
from apache_beam.io.filesystem import FileMetadata
from apache_beam.io.filesystem import FileSystem
from apache_beam.io.gcp.gcsfilesystem import GCSFileSystem

class GCSFileReader:
  """Helper class to read gcs files"""
  def __init__(self, gcs):
      self.gcs = gcs

class AddFilenamesFn(beam.DoFn):
    """ParDo to output a dict with filename and row"""
    def process(self, element, file_path):
        file_name = file_path.split("/")[-1]
        # yield (file_name, element) # use this to return a tuple instead
        yield {'filename':file_name, 'row':element}

# just logging output to visualize results
def write_res(element):
  logging.info(element)
  return element

def run(argv=None):
  parser = argparse.ArgumentParser()
  known_args, pipeline_args = parser.parse_known_args(argv)

  p = beam.Pipeline(options=PipelineOptions(pipeline_args))
  gcs = GCSFileSystem(PipelineOptions(pipeline_args))
  gcs_reader = GCSFileReader(gcs)

  # in my case I am looking for files that start with 'countries'
  BUCKET='BUCKET_NAME'
  result = [m.metadata_list for m in gcs.match(['gs://{}/countries*'.format(BUCKET)])]
  result = reduce(add, result)

  # create each input PCollection name and unique step labels
  variables = ['p{}'.format(i) for i in range(len(result))]
  read_labels = ['Read file {}'.format(i) for i in range(len(result))]
  add_filename_labels = ['Add filename {}'.format(i) for i in range(len(result))]

  # load each input file into a separate PCollection and add filename to each row
  for i in range(len(result)):
    # globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.Map(lambda elem: (result[i].path, elem))
    globals()[variables[i]] = p | read_labels[i] >> ReadFromText(result[i].path) | add_filename_labels[i] >> beam.ParDo(AddFilenamesFn(), result[i].path)

  # flatten all PCollections into a single one
  merged = [globals()[variables[i]] for i in range(len(result))] | 'Flatten PCollections' >> beam.Flatten() | 'Write results' >> beam.Map(write_res)

  p.run()

if __name__ == '__main__':
  run()

答案 1 :(得分:2)

我必须读取一些元数据文件并使用文件名进行进一步处理。 当我终于遇到 apache_beam.io.ReadFromTextWithFilename

时,我很挣扎
def run(argv=None, save_main_session=True):
    import typing
    import apache_beam as beam
    from apache_beam.options.pipeline_options import PipelineOptions
    from apache_beam.io import ReadFromTextWithFilename
  
                
    class ExtractMetaData(beam.DoFn):
        def process(self, element):
            filename, meta = element
            image_name = filename.split("/")[-2]
            labels = json.loads(meta)["labels"]
            image = {"image_name": image_name, "labels": labels}
            print(image)
            return image

    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)
    pipeline_options = PipelineOptions(pipeline_args)
    
  
    with beam.Pipeline(options=pipeline_options) as pipeline:
        meta = (
            pipeline
            | "Read Metadata" >> ReadFromTextWithFilename(f'gs://{BUCKET}/dev-set/**/*metadata.json')
            | beam.ParDo(ExtractMetaData())


        )
 
    pipeline.run()