故事
我支持导出数据的GAE webapp。目前,它使用for循环将ndb表转储到csv文件中。 for循环在这一点上花费的时间太长(~25分钟),并且在作业需要移动到另一台机器之前并不总是完成。我试图使用MapReduce()缩短作业,但我的MapReduce()作业运行了几个小时,没有任何重要的输出,错误,日志,任何东西。映射器没有完成,甚至没有尝试写入BQ。我确定我错过了什么。任何建议都会有所帮助。
代码(某些变量名称已更改以保护其身份)
Original For Loop
more = True
cursor = None
formatted_rows = []
query = NDBTable.query()
while more:
rows, cursor, more = query.fetch_page(page_size=5000, start_cursor=cursor)
try:
formatted_rows += [datastore_map(row) for row in rows]
except Timeout, e:
error_msg = "{}\nThe Datastore timed out while trying to map the row export for \n{}".format(repr(e), rows)
logging.error(error_msg)
filename = '/filepath'
gcs_file = gcs.open(filename, 'w', content_type='text/csv')
output = StringIO.StringIO()
output.write('Column headers')
output.write('\n')
for row in formatted_rows:
output.write(str(row))
if len(output.getvalue()) > 0:
gcs_file.write(output.getvalue())
output.close()
gcs_file.close()
datastore_map()
def datastore_map(entity_type):
try:
data = entity_type.to_dict()
except ValueError, e:
error_msg = "Problem loading entity to dict: {e}".format(e=repr(e))
logging.error(error_msg)
yield ''
try:
value6 = OtherNDBTable.query(OtherNDBTable.value == data.value)
except AttributeError, e:
warn_msg = "Could not get the value for row {row_dict}\n{msg}".format(row_dict=data, msg=repr(e))
value6 = ""
logging.warning(warn_msg)
try:
result_list = [
data.get('value1'),
data.get('value2'),
data.get('value3'),
data.get('value4'),
data.get('value5'),
value6
]
except Exception, e:
logging.warning("Other Exception: {} for \n {}".format(repr(e), data))
yield ''
result = ','.join(['"%s"' % field for field in result_list])
yield "%s\n" % result
MapReduce()管道
class DatastoreMapperPipeline(base_handler.PipelineBase):
def run(self, entity_type):
outputs = yield mapreduce_pipeline.MapperPipeline(
"Datastore Mapper %s" % entity_type,
"main.datastore_map",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader": {
"entity_kind": entity_type,
},
"output_writer": {
"filesystem": "gs",
"gs_bucket_name": GS_BUCKET,
"output_sharding": "none",
}
},
shards=X) #X has been 10, 36, and 500 with no difference
yield CloudStorageToBigQuery(outputs) # Doesn't get here
def finalized(self):
logging.debug("Pipeline {} has finished with outputs {}".format(self.pipeline_id, self.outputs))
合
App引擎日志只启动启动返回代码200的作业的url。任何日志中都不会出现任何其他内容。 MapReduce()仪表板显示正在运行的作业以及运行该作业的所有分片。每个分片的最后一个工作项是未知的,即使运行的总时间已经过了几小时,它的时间也只有几秒钟。 如果您还需要其他任何内容来帮助解答,请告诉我们。 提前感谢您的帮助。