编辑＃1：

Question

我编写了一个简单的MapReduce流程，用于从Google云端存储上的文件中读取CSV行，然后创建实体。但是，我似乎无法让它在多个碎片上运行。

代码使用了mapreduce.control.start_map，看起来像这样。

class LoadEntitiesPipeline(webapp2.RequestHandler):
        id = control.start_map(map_name,
                          handler_spec="backend.line_processor",
                          reader_spec="mapreduce.input_readers.FileInputReader",
                          queue_name=get_queue_name("q-1"),
                          shard_count=shard_count,
                          mapper_parameters={
                              'shard_count': shard_count,
                              'batch_size': 50,
                              'processing_rate': 1000000,
                              'files': [gsfile],
                              'format': 'lines'})

我在两个地方都有shard_count，因为我不确定实际需要什么方法。将shard_count设置为8到32之间的任何位置都不会改变任何内容，因为状态页总是表示正在运行1/1分片。为了分离事物，我已经使一切都在具有大量实例的后端队列上运行。我已经尝试调整队列参数per this wiki。最后，它似乎只是连续运行。

有什么想法吗？谢谢！

更新（仍未成功）：

在尝试隔离事物时，我尝试使用直接调用管道进行调用，如下所示：

class ImportHandler(webapp2.RequestHandler):

    def get(self, gsfile):
        pipeline = LoadEntitiesPipeline2(gsfile)
        pipeline.start(queue_name=get_queue_name("q-1"))

        self.redirect(pipeline.base_path + "/status?root=" + pipeline.pipeline_id)


class LoadEntitiesPipeline2(base_handler.PipelineBase):

    def run(self, gsfile):
        yield mapreduce_pipeline.MapperPipeline(
           'loadentities2_' + gsfile,
           'backend.line_processor',
           'mapreduce.input_readers.FileInputReader',
           params={'files': [gsfile], 'format': 'lines'},
           shards=32
        )

使用这个新代码，它仍然只能在一个分片上运行。 我开始怀疑mapreduce.input_readers.FileInputReader是否能够逐行并行输入。

Answer 1

看起来FileInputReader只能通过文件进行分片。 format参数仅改变mapper函数调用的方式。如果将多个文件传递给映射器，它将开始在多个分片上运行。否则它只会使用一个分片来处理数据。

编辑＃1：

深入挖掘mapreduce库之后。 MapReduce将根据它定义的每种文件类型的can_split方法返回来决定是否将文件拆分为多个部分。目前，实现split方法的唯一格式是ZipFormat。因此，如果您的文件格式不是zip，则不会将文件拆分为在多个分片上运行。

@classmethod
  def can_split(cls):
    """Indicates whether this format support splitting within a file boundary.

    Returns:
      True if a FileFormat allows its inputs to be splitted into
    different shards.
    """

https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/file_formats.py

但看起来可以编写自己的文件格式拆分方法。您可以先在split上尝试破解并添加_TextFormat方法，然后查看是否有多个分片在运行。

@classmethod
def split(cls, desired_size, start_index, opened_file, cache):
    pass

编辑＃2：

一个简单的解决方法是FileInputReader连续运行，但将耗时的任务移到并行reduce阶段。

def line_processor(line):
    # serial
    yield (random.randrange(1000), line)

def reducer(key, values):
    # parallel
    entities = []
    for v in values:
        entities.append(CREATE_ENTITY_FROM_VALUE(v))
    db.put(entities)

编辑＃3：

如果尝试修改FileFormat，这里有一个例子（尚未测试）

from file_formats import _TextFormat, FORMATS


class _LinesSplitFormat(_TextFormat):
  """Read file line by line."""

  NAME = 'split_lines'

  def get_next(self):
    """Inherited."""
    index = self.get_index()
    cache = self.get_cache()
    offset = sum(cache['infolist'][:index])

    self.get_current_file.seek(offset)
    result = self.get_current_file().readline()
    if not result:
      raise EOFError()
    if 'encoding' in self._kwargs:
      result = result.encode(self._kwargs['encoding'])
    return result

  @classmethod
  def can_split(cls):
    """Inherited."""
    return True

  @classmethod
  def split(cls, desired_size, start_index, opened_file, cache):
    """Inherited."""
    if 'infolist' in cache:
      infolist = cache['infolist']
    else:
      infolist = []
      for i in opened_file:
        infolist.append(len(i))
        cache['infolist'] = infolist

    index = start_index
    while desired_size > 0 and index < len(infolist):
      desired_size -= infolist[index]
      index += 1
    return desired_size, index


FORMATS['split_lines'] = _LinesSplitFormat

然后，可以通过将mapper_parameters从lines更改为split_line来调用新的文件格式。

class LoadEntitiesPipeline(webapp2.RequestHandler):
    id = control.start_map(map_name,
                      handler_spec="backend.line_processor",
                      reader_spec="mapreduce.input_readers.FileInputReader",
                      queue_name=get_queue_name("q-1"),
                      shard_count=shard_count,
                      mapper_parameters={
                          'shard_count': shard_count,
                          'batch_size': 50,
                          'processing_rate': 1000000,
                          'files': [gsfile],
                          'format': 'split_lines'})

Answer 2

在我看来，像FileInputReader应该能够基于快速读取的分片： https://code.google.com/p/appengine-mapreduce/source/browse/trunk/python/src/mapreduce/input_readers.py

看起来像'format'：'lines'应该使用：self.get_current_file（）。readline（）

分割

在串行工作时，它似乎正确地解释了这些行吗？也许换行符是错误的编码或其他东西。

Answer 3

根据经验，FileInputReader将为每个文件最多执行一次分片。解决方案：拆分大文件。我在https://github.com/johnwlockwood/karl_data中使用split_file对文件进行分片，然后再将其上传到云存储。如果大文件已经在那里，你可以使用计算引擎实例将它们拉下并进行分片，因为传输速度最快。仅供参考：karld在cheeseshop，所以你可以pip install karld

如何让AppEngine地图缩小到缩小？

3 个答案:

编辑＃1：

编辑＃2：

编辑＃3：