Question

拥有GAE数据存储类，其中包含数个100'000个对象。想要做几个涉及的查询（涉及计数查询）。大查询似乎是适合这样做的上帝。

目前有一种使用Big Query查询实时AppEngine数据存储区的简便方法吗？

Answer 1

您无法直接在DataStore实体上运行BigQuery，但您可以编写Mapper管道，从DataStore中读取实体，将它们写入Google云端存储中的CSV，然后将这些实体摄取到BigQuery中 - 您甚至可以自动执行处理。以下是使用Mapper API类仅用于DataStore到CSV步骤的示例：

import re
import time
from datetime import datetime
import urllib
import httplib2
import pickle

from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp

from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template

from mapreduce.lib import files
from google.appengine.api import taskqueue
from google.appengine.api import users

from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op

from apiclient.discovery import build
from google.appengine.api import memcache
from oauth2client.appengine import AppAssertionCredentials


#Number of shards to use in the Mapper pipeline
SHARDS = 20

# Name of the project's Google Cloud Storage Bucket
GS_BUCKET = 'your bucket'

# DataStore Model
class YourEntity(db.Expando):
  field1 = db.StringProperty() # etc, etc

ENTITY_KIND = 'main.YourEntity'


class MapReduceStart(webapp.RequestHandler):
  """Handler that provides link for user to start MapReduce pipeline.
  """
  def get(self):
    pipeline = IteratorPipeline(ENTITY_KIND)
    pipeline.start()
    path = pipeline.base_path + "/status?root=" + pipeline.pipeline_id
    logging.info('Redirecting to: %s' % path)
    self.redirect(path)


class IteratorPipeline(base_handler.PipelineBase):
  """ A pipeline that iterates through datastore
  """
  def run(self, entity_type):
    output = yield mapreduce_pipeline.MapperPipeline(
      "DataStore_to_Google_Storage_Pipeline",
      "main.datastore_map",
      "mapreduce.input_readers.DatastoreInputReader",
      output_writer_spec="mapreduce.output_writers.FileOutputWriter",
      params={
          "input_reader":{
              "entity_kind": entity_type,
              },
          "output_writer":{
              "filesystem": "gs",
              "gs_bucket_name": GS_BUCKET,
              "output_sharding":"none",
              }
          },
          shards=SHARDS)


def datastore_map(entity_type):
  props = GetPropsFor(entity_type)
  data = db.to_dict(entity_type)
  result = ','.join(['"%s"' % str(data.get(k)) for k in props])
  yield('%s\n' % result)


def GetPropsFor(entity_or_kind):
  if (isinstance(entity_or_kind, basestring)):
    kind = entity_or_kind
  else:
    kind = entity_or_kind.kind()
  cls = globals().get(kind)
  return cls.properties()


application = webapp.WSGIApplication(
                                     [('/start', MapReduceStart)],
                                     debug=True)

def main():
  run_wsgi_app(application)

if __name__ == "__main__":
  main()

如果将其附加到IteratorPipeline类的末尾：yield CloudStorageToBigQuery(output)，则可以将生成的csv文件句柄通过管道传输到BigQuery提取管道中......如下所示：

class CloudStorageToBigQuery(base_handler.PipelineBase):
  """A Pipeline that kicks off a BigQuery ingestion job.
  """
  def run(self, output):

# BigQuery API Settings
SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID = 'Some_ProjectXXXX'
DATASET_ID = 'Some_DATASET'

# Create a new API service for interacting with BigQuery
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery_service = build("bigquery", "v2", http=http)

jobs = bigquery_service.jobs()
table_name = 'datastore_dump_%s' % datetime.utcnow().strftime(
    '%m%d%Y_%H%M%S')
files = [str(f.replace('/gs/', 'gs://')) for f in output]
result = jobs.insert(projectId=PROJECT_ID,
                    body=build_job_data(table_name,files)).execute()
logging.info(result)

def build_job_data(table_name, files):
  return {"projectId": PROJECT_ID,
          "configuration":{
              "load": {
                  "sourceUris": files,
                  "schema":{
                      # put your schema here
                      "fields": fields
                      },
                  "destinationTable":{
                      "projectId": PROJECT_ID,
                      "datasetId": DATASET_ID,
                      "tableId": table_name,
                      },
                  }
              }
          }

Answer 2

使用新版（2013年9月）streaming inserts api，您可以将应用中的记录导入BigQuery。

这些数据立即在BigQuery中提供，因此这应该满足您的实时要求。

虽然这个问题现在有点陈旧，但对于任何绊倒这个问题的人来说，这可能是一个更容易的解决方案

目前虽然从本地开发服务器开始工作，但最多也是不完整的。

Answer 3

我们正在做一个Trusted Tester程序，用于通过两个简单的操作从Datastore迁移到BigQuery：

使用数据存储管理员的备份功能
直接将备份导入BigQuery

它会自动为您处理架构。

更多信息（申请）：https://docs.google.com/a/google.com/spreadsheet/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ

Answer 4

对于BigQuery，您必须将这些Kind导出为CSV或分隔的记录结构，加载到BigQuery并且您可以查询。我所知道的设备无法查询实时GAE数据存储区。

Biquery是分析查询引擎，意味着您无法更改记录。不允许更新或删除，您只能追加。

Answer 5

不，BigQuery是一种不同的产品，需要将数据上传到它。它无法在数据存储区上运行。您可以使用GQL查询数据存储区。

Answer 6

截至2016年，现在很有可能！您必须执行以下操作：

在Google存储中创建一个新存储桶
使用console.developers.google.com上的数据库管理员备份实体我有一个完整的教程
前往bigquery Web UI，并导入步骤1中生成的文件。

有关此工作流程的完整示例，请参阅this post！

Google App Engine：在数据存储上使用大查询？

6 个答案: