链接MapReduces - Google AppEngine

时间:2017-03-26 17:43:36

标签: python google-app-engine mapreduce

我试图通过链接管道来将reduce的输出发送到地图,类似于这个人: I would like to chain multiple mapreduce jobs in google app engine in Python 我尝试了他的解决方案,但它没有奏效。 我的管道流程是:
地图1
降低1
MAP2
降低2
我将Reduce1的输出保存到blob_key下的blobstore,然后尝试从Map2访问blob。但是在执行第二张地图时出现以下错误:update = function() { if (moved = true) { if(cat.direction != 'right' && key[37] === true){

这是管道代码:

"BadReaderParamsError: Could not find blobinfo for key <blob_key here>"

这里是BlobKey类,它接受中间输出并为Map2生成blob键:

class SongsPurchasedTogetherPipeline(base_handler.PipelineBase):

  def run(self, filekey, blobkey):
    bucket_name = app_identity.get_default_gcs_bucket_name()
    intermediate_output = yield mapreduce_pipeline.MapreducePipeline(
        "songs_purchased_together_intermediate",
        "main.songs_purchased_together_map1",
        "main.songs_purchased_together_reduce1",
        "mapreduce.input_readers.BlobstoreLineInputReader",
        "mapreduce.output_writers.GoogleCloudStorageOutputWriter",
        mapper_params={
            "blob_keys": blobkey,
        },
        reducer_params={
            "output_writer": {
                "bucket_name": bucket_name,
                "content_type": "text/plain",
            }
        },
        shards=1)
    yield StoreOutput("SongsPurchasedTogetherIntermediate", filekey, intermediate_output)

    intermediate_output_key = yield BlobKey(intermediate_output)
    output = yield mapreduce_pipeline.MapreducePipeline(
        "songs_purchased_together",
        "main.songs_purchased_together_map2",
        "main.songs_purchased_together_reduce2",
        "mapreduce.input_readers.BlobstoreLineInputReader",
        "mapreduce.output_writers.GoogleCloudStorageOutputWriter",
        mapper_params=(intermediate_output_key),
        reducer_params={
            "output_writer": {
                "bucket_name": bucket_name,
                "content_type": "text/plain",
            }
        },
        shards=1)
    yield StoreOutput("SongsPurchasedTogether", filekey, output)

StoreOutput类与Google的MapReduce演示https://github.com/GoogleCloudPlatform/appengine-mapreduce/blob/master/python/demo/main.py中的类相同,并且与BlobKey类完全相同,但另外将blob的URL发送到HTML作为链接。

手动访问URL class BlobKey(base_handler.PipelineBase): def run(self, output): blobstore_filename = "/gs" + output[0] blobstore_gs_key = blobstore.create_gs_key(blobstore_filename) return { "blob_keys": blobstore_gs_key } ,在浏览器中输入(Reduce1成功后,但Map2失败)显示Reduce1预期的输出。为什么Map2找不到blob?对不起,我是AppEngine的新手,我可能在某个地方出错,因为我不完全了解blob存储。

1 个答案:

答案 0 :(得分:0)

好的,我发现Google已经从GAE GitHub存储库的标准编写器列表中删除了BlobstoreOutputWriter,这使得事情变得更加复杂。我不得不写信给谷歌云商店,并从那里阅读。我编写了一个帮助程序类,它为GoogleCloudStorageInputReader生成映射器参数。

class GCSMapperParams(base_handler.PipelineBase):

  def run(self, GCSPath):
    bucket_name = app_identity.get_default_gcs_bucket_name()
    return {
            "input_reader": {
                "bucket_name": bucket_name,
                "objects": [path.split('/', 2)[2] for path in GCSPath],
            }
        }

该函数将一个MapReduce阶段的输出作为参数,该阶段使用GoogleCloudStorageOutputWriter,并返回一个字典,该字典可以分配给下一个MapReduce阶段的mapper_params。

基本上,第一个MapReduce阶段的输出值是一个包含<app_name>/<pipeline_name>/key/output-[i]的列表,其中i是分片数。要使用GoogleCloudStorageInputReader,数据的关键字应通过objects中的变量mapper_params传递。密钥必须是key/output-[i]形式,因此帮助程序类只是从中删除<app_name>/<pipeline_name>/