在mapreduce中处理的文件

时间:2014-03-17 00:25:29

标签: python google-app-engine mapreduce

我正在尝试确定哪些文件是由AppEngine上的mapreduce作业处理的。我在云存储桶中使用通配符:

class FilesPipeline(base_handler.PipelineBase):

    def run(self):

        output_blobstore_ids = yield mapreduce_pipeline.MapreducePipeline(
            "proc_files",
            "project.task.proc_files.mapper",
            "project.task.proc_files.reducer",
            "mapreduce.input_readers.FileInputReader",
            "mapreduce.output_writers.BlobstoreOutputWriter"
            ,mapper_params={
              'shard_count': 4,
              'batch_size': 50,
              'files': ['/gs/project_inbox/partner*'],
              'format': 'lines'}
            ,reducer_params={
            },
            shards=4)

        yield StoreOutput(output_blobstore_ids)

理想情况下,我想获取StoreOutput实例中mapreduce管道处理的文件列表,但任何地方都可以。

谢谢!

1 个答案:

答案 0 :(得分:0)

这是我发现的:

准备管道

files_list=self.get_files_list()
pipeline=FilesPipeline(mapper_files=files_list)
pipeline.start(queue_name='proc-files')

管道定义

class FilesPipeline(base_handler.PipelineBase):     def run(self,mapper_files = []):

    output_blobstore_ids = yield mapreduce_pipeline.MapreducePipeline(
        "proc_files",
        "project.task.proc_files.mapper",
        "project.task.proc_files.reducer",
        "mapreduce.input_readers.FileInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter"
        ,mapper_params={
          'shard_count': 4
          ,'batch_size': 50
          ,"files":      mapper_files
          ,'format':     'lines'}
        ,reducer_params={
        },
        shards=4)

    yield StoreOutput(output_blobstore_ids)

希望这有助于某人。