我想在Python的谷歌应用引擎中链接多个mapreduce作业

时间:2014-06-09 16:04:57

标签: python google-app-engine mapreduce

警告:我是Google App Engine和Python的新手,但到目前为止,我已成功在Google App Engine中实施了PageRank算法。

接下来,我想在Google App Engine中将三个mapreduce作业链接在一起。但是,我不明白如何使用BlobKeys将键值对从第一个mapreduce作业传递到第二个mapreduce作业(以及随后第二个mapreduce作业到第三个)。我试图遵循以下网站上介绍的内容:

http://mattfaus.com/2013/10/google-appengine-mapreduce-in-depth/

使用BlobKeys类将BlobKey从一个mapreduce作业传递到下一个。我认为我正在错误地实现python类,因为在调用时,“third_party”对象在下面的代码中无法识别。

有人可以指出我哪里出错了。抱歉无法提供本地驱动的测试。这似乎是一个野兽!

以下是我尝试使用的课程:

class BlobKeys(mapreduce.base_handler.PipelineBase):
  """Returns a dictionary with the supplied keyword arguments."""

  def run(self, keys):
    # Remove the key from a string in this format:
    # /blobstore/<key>
    return {
        "blob_keys": [k.split("/")[-1] for k in keys]
    }

这是调用上面类的Pipeline代码(不识别third_party对象):

num_shards=2
# First define the parent pipeline job
class RecommenderPipeline(base_handler.PipelineBase):
  """A pipeline to run Recommender demo.

  Args:
    blobkey: blobkey to process as string. Should be a zip archive with
      text files inside.
  """

  def run(self, filekey, blobkey, itr):
    #logging.debug("filename is %s" % filekey)
    output1 = yield mapreduce_pipeline.MapreducePipeline(
        "recommender",
        "main.recommender_group_by_user_rating_map1",
        "main.recommender_count_ratings_user_freq_reduce1",
        "mapreduce.input_readers.BlobstoreLineInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter",
        mapper_params={
            "blob_keys": blobkey,
        },
        reducer_params={
            "mime_type": "text/plain",
        },
        shards=num_shards)


    # Code below takes output1 and feeds into second mapreduce job.
    # Pipeline library ensures that the second pipeline depends on first and 
    # does not launch until the first has resolved.
    output2 = (
    yield mapreduce_pipeline.MapreducePipeline(
        "recommender",
        "main.recommender_pairwise_items_map2",
        "main.recommender_calc_similarity_reduce2",
        "mapreduce.input_readers.BlobstoreLineInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter",
        mapper_params=( BlobKeys(output1)), #see BlobKeys Class!`
        # "blob_keys": [k.split("/")[-1] for k in keys]
        #"blob_keys": blobkey, # did not work since "generator pipelines cannot
        # directly access ouputs of the child Pipelines that it yields", this code
        # would require the generator pipeline to create a temporary dict object 
        # with the output of the first job - this is not allowed.
        # In addition, the string returned by BobStoreOutputWriter is in the format
        # /blobstore/<key>, but BlobStoreLineInputReader expects only "<key>"
        # To solve these problems, use the BlobKeys class above.
        #},
        #mapper_params={
        #    #"blob_keys": [k.split("/")[-1] for k in output1]
        #    "blob_keys": blobkey.split("/")[-1],
        #},
        reducer_params={
            "mime_type": "text/plain",
        },
        shards=num_shards))

    # Code below takes output2 and feeds into third mapreduce job.
    # Pipeline library ensures that the third pipeline depends on second and 
    # does not launch until the second has resolved.
    output3 = (
    yield mapreduce_pipeline.MapreducePipeline(
        "recommender",
        "main.recommender_calculate_ranking_map3",
        "main.recommender_ranked_items_reduce3",
        "mapreduce.input_readers.BlobstoreLineInputReader",
        "mapreduce.output_writers.BlobstoreOutputWriter",
        mapper_params=( BlobKeys(output2)), #see BobKeys Class!`
        #mapper_params={
        #    "blob_keys": blobkey.split("/")[-1],
        #},
        reducer_params={
            "mime_type": "text/plain",
        },
        shards=num_shards))
    yield StoreOutput("Recommender", filekey, output3, itr)  #stores key to results so you can look at it.

我想知道我是否在使用Python类时遇到更多问题,或者更多是在GAE中实现此问题的问题。我怀疑两者混合在一起。任何帮助将不胜感激!谢谢!

1 个答案:

答案 0 :(得分:1)

管道参数可以是具体值或PipelineFutures(在这种情况下,它将等待未来的值可用)。 在您的情况下,您将PipelineFutures作为参数传递给具体值(BlobKeys)。 而是尝试生成BlobKeys(output1)并将其结果作为参数传递给下一个管道。 例如: output1_1 = yield BlobKeys(output1) output2 = yield mapreduce_pipeline.MapreducePipeline(...,mapper_params = output1_1,...)

相关问题