MapReduce()占用比循环更长的时间

时间:2014-07-24 13:17:39

标签: google-app-engine mapreduce

故事

我支持导出数据的GAE webapp。目前,它使用for循环将ndb表转储到csv文件中。 for循环在这一点上花费的时间太长(~25分钟),并且在作业需要移动到另一台机器之前并不总是完成。我试图使用MapReduce()缩短作业,但我的MapReduce()作业运行了几个小时,没有任何重要的输出,错误,日志,任何东西。映射器没有完成,甚至没有尝试写入BQ。我确定我错过了什么。任何建议都会有所帮助。

代码(某些变量名称已更改以保护其身份)

Original For Loop

more = True
cursor = None
formatted_rows = []
query = NDBTable.query()
while more:
    rows, cursor, more = query.fetch_page(page_size=5000, start_cursor=cursor)
    try:
        formatted_rows += [datastore_map(row) for row in rows]
    except Timeout, e:
        error_msg = "{}\nThe Datastore timed out while trying to map the row export for \n{}".format(repr(e), rows)
    logging.error(error_msg)


filename = '/filepath'
gcs_file = gcs.open(filename, 'w', content_type='text/csv')

output = StringIO.StringIO()
output.write('Column headers')
output.write('\n')
for row in formatted_rows:
    output.write(str(row))

if len(output.getvalue()) > 0:
    gcs_file.write(output.getvalue())
    output.close()
    gcs_file.close()

datastore_map()

def datastore_map(entity_type):
    try:
        data = entity_type.to_dict()
    except ValueError, e:
        error_msg = "Problem loading entity to dict: {e}".format(e=repr(e))
        logging.error(error_msg)
        yield ''

    try:
        value6 = OtherNDBTable.query(OtherNDBTable.value == data.value)
    except AttributeError, e:
        warn_msg = "Could not get the value for row {row_dict}\n{msg}".format(row_dict=data, msg=repr(e))
        value6 = ""
        logging.warning(warn_msg)

    try:
        result_list = [
            data.get('value1'),
            data.get('value2'),
            data.get('value3'),
            data.get('value4'),
            data.get('value5'),
            value6
        ]
    except Exception, e:
        logging.warning("Other Exception: {} for \n {}".format(repr(e), data))
        yield ''
    result = ','.join(['"%s"' % field for field in result_list])
    yield "%s\n" % result

MapReduce()管道

class DatastoreMapperPipeline(base_handler.PipelineBase):
    def run(self, entity_type):
        outputs = yield mapreduce_pipeline.MapperPipeline(
            "Datastore Mapper %s" % entity_type,
            "main.datastore_map",
            "mapreduce.input_readers.DatastoreInputReader",
            output_writer_spec="mapreduce.output_writers.FileOutputWriter",
            params={
                "input_reader": {
                    "entity_kind": entity_type,
                },
                "output_writer": {
                    "filesystem": "gs",
                    "gs_bucket_name": GS_BUCKET,
                    "output_sharding": "none",
                }
            },
            shards=X) #X has been 10, 36, and 500 with no difference
        yield CloudStorageToBigQuery(outputs) # Doesn't get here

    def finalized(self):
        logging.debug("Pipeline {} has finished with outputs {}".format(self.pipeline_id, self.outputs))

App引擎日志只启动启动返回代码200的作业的url。任何日志中都不会出现任何其他内容。 MapReduce()仪表板显示正在运行的作业以及运行该作业的所有分片。每个分片的最后一个工作项是未知的,即使运行的总时间已经过了几小时,它的时间也只有几秒钟。 如果您还需要其他任何内容来帮助解答,请告诉我们。 提前感谢您的帮助。

0 个答案:

没有答案