拥有GAE数据存储类,其中包含数个100'000个对象。想要做几个涉及的查询(涉及计数查询)。大查询似乎是适合这样做的上帝。
目前有一种使用Big Query查询实时AppEngine数据存储区的简便方法吗?
答案 0 :(得分:17)
您无法直接在DataStore实体上运行BigQuery,但您可以编写Mapper管道,从DataStore中读取实体,将它们写入Google云端存储中的CSV,然后将这些实体摄取到BigQuery中 - 您甚至可以自动执行处理。以下是使用Mapper API类仅用于DataStore到CSV步骤的示例:
import re
import time
from datetime import datetime
import urllib
import httplib2
import pickle
from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template
from mapreduce.lib import files
from google.appengine.api import taskqueue
from google.appengine.api import users
from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op
from apiclient.discovery import build
from google.appengine.api import memcache
from oauth2client.appengine import AppAssertionCredentials
#Number of shards to use in the Mapper pipeline
SHARDS = 20
# Name of the project's Google Cloud Storage Bucket
GS_BUCKET = 'your bucket'
# DataStore Model
class YourEntity(db.Expando):
field1 = db.StringProperty() # etc, etc
ENTITY_KIND = 'main.YourEntity'
class MapReduceStart(webapp.RequestHandler):
"""Handler that provides link for user to start MapReduce pipeline.
"""
def get(self):
pipeline = IteratorPipeline(ENTITY_KIND)
pipeline.start()
path = pipeline.base_path + "/status?root=" + pipeline.pipeline_id
logging.info('Redirecting to: %s' % path)
self.redirect(path)
class IteratorPipeline(base_handler.PipelineBase):
""" A pipeline that iterates through datastore
"""
def run(self, entity_type):
output = yield mapreduce_pipeline.MapperPipeline(
"DataStore_to_Google_Storage_Pipeline",
"main.datastore_map",
"mapreduce.input_readers.DatastoreInputReader",
output_writer_spec="mapreduce.output_writers.FileOutputWriter",
params={
"input_reader":{
"entity_kind": entity_type,
},
"output_writer":{
"filesystem": "gs",
"gs_bucket_name": GS_BUCKET,
"output_sharding":"none",
}
},
shards=SHARDS)
def datastore_map(entity_type):
props = GetPropsFor(entity_type)
data = db.to_dict(entity_type)
result = ','.join(['"%s"' % str(data.get(k)) for k in props])
yield('%s\n' % result)
def GetPropsFor(entity_or_kind):
if (isinstance(entity_or_kind, basestring)):
kind = entity_or_kind
else:
kind = entity_or_kind.kind()
cls = globals().get(kind)
return cls.properties()
application = webapp.WSGIApplication(
[('/start', MapReduceStart)],
debug=True)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
如果将其附加到IteratorPipeline类的末尾:yield CloudStorageToBigQuery(output)
,则可以将生成的csv文件句柄通过管道传输到BigQuery提取管道中......如下所示:
class CloudStorageToBigQuery(base_handler.PipelineBase):
"""A Pipeline that kicks off a BigQuery ingestion job.
"""
def run(self, output):
# BigQuery API Settings
SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID = 'Some_ProjectXXXX'
DATASET_ID = 'Some_DATASET'
# Create a new API service for interacting with BigQuery
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery_service = build("bigquery", "v2", http=http)
jobs = bigquery_service.jobs()
table_name = 'datastore_dump_%s' % datetime.utcnow().strftime(
'%m%d%Y_%H%M%S')
files = [str(f.replace('/gs/', 'gs://')) for f in output]
result = jobs.insert(projectId=PROJECT_ID,
body=build_job_data(table_name,files)).execute()
logging.info(result)
def build_job_data(table_name, files):
return {"projectId": PROJECT_ID,
"configuration":{
"load": {
"sourceUris": files,
"schema":{
# put your schema here
"fields": fields
},
"destinationTable":{
"projectId": PROJECT_ID,
"datasetId": DATASET_ID,
"tableId": table_name,
},
}
}
}
答案 1 :(得分:7)
使用新版(2013年9月)streaming inserts api,您可以将应用中的记录导入BigQuery。
这些数据立即在BigQuery中提供,因此这应该满足您的实时要求。
虽然这个问题现在有点陈旧,但对于任何绊倒这个问题的人来说,这可能是一个更容易的解决方案
目前虽然从本地开发服务器开始工作,但最多也是不完整的。
答案 2 :(得分:5)
我们正在做一个Trusted Tester程序,用于通过两个简单的操作从Datastore迁移到BigQuery:
它会自动为您处理架构。
更多信息(申请):https://docs.google.com/a/google.com/spreadsheet/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
答案 3 :(得分:3)
对于BigQuery,您必须将这些Kind导出为CSV或分隔的记录结构,加载到BigQuery并且您可以查询。我所知道的设备无法查询实时GAE数据存储区。
Biquery是分析查询引擎,意味着您无法更改记录。不允许更新或删除,您只能追加。
答案 4 :(得分:2)
不,BigQuery是一种不同的产品,需要将数据上传到它。它无法在数据存储区上运行。您可以使用GQL查询数据存储区。
答案 5 :(得分:1)
截至2016年,现在很有可能!您必须执行以下操作:
有关此工作流程的完整示例,请参阅this post!