(我已经添加了google-analytics api标签,但我怀疑我的问题更像是我的循环方法的基本缺陷,如下所述)
我正在使用Python查询Google Analytics(分析)API(V4)。已经使用我的凭据成功连接到API,然后尝试遍历API返回的每10k结果集,以获取完整的结果集。
查询API时,您传递的命令看起来像这样:
{'reportRequests':[{'viewId': '1234567', # my actual view id goes here of course
'pageToken': 'go', # can be any string initially (I think?)
'pageSize': 10000,
'samplingLevel': 'LARGE',
'dateRanges': [{'startDate': '2018-06-01', 'endDate': '2018-07-13'}],
'dimensions': [{'name': 'ga:date'}, {'name': 'ga:dimension1'}, {'name': 'ga:dimension2'}, {'name': 'ga:userType'}, {'name': 'ga:landingpagePath'}, {'name': 'ga:deviceCategory'}],
'metrics': [{'expression': 'ga:sessions'}, {'expression': 'ga:bounces'}, {'expression': 'ga:goal1Completions'}]}]}
根据pageToken参数上的the documentation on Google Analytics API V4:
"A continuation token to get the next page of the results. Adding this to the request will return the rows after the pageToken. The pageToken should be the value returned in the nextPageToken parameter in the response to the reports.batchGet request. "
我的理解是,我需要以10,000个块(允许的最大查询结果大小)查询API,并且必须将每个查询结果中返回的nextPageToken字段的值传递给新查询。
在研究中,当所有结果都返回时,听起来nextPageToken字段将是一个空字符串。
因此,我尝试了while循环。为了进入循环阶段,我构建了一些函数:
## generates the dimensions in the right format to use in the query
def generate_dims(dims):
dims_ar = []
for i in dims:
d = {'name': i}
dims_ar.append(d)
return(dims_ar)
## generates the metrics in the right format to use in the query
def generate_metrics(mets):
mets_ar = []
for i in mets:
m = {'expression': i}
mets_ar.append(m)
return(mets_ar)
## generate the query dict
def query(pToken, dimensions, metrics, start, end):
api_query = {
'reportRequests': [
{'viewId': VIEW_ID,
'pageToken': pToken,
'pageSize': 10000,
'samplingLevel': 'LARGE',
'dateRanges': [{'startDate': start, 'endDate': end}],
'dimensions': generate_dims(dimensions),
'metrics': generate_metrics(metrics)
}]
}
return(api_query)
上述3个函数的示例输出:
sessions1_qr = query(pToken = pageToken,
dimensions = ['ga:date', 'ga:dimension1', 'ga:dimension2',
'ga:userType', 'ga:landingpagePath',
'ga:deviceCategory'],
metrics = ['ga:sessions', 'ga:bounces', 'ga:goal1Completions'],
start = '2018-06-01',
end = '2018-07-13')
结果类似于这篇文章中的第一个代码块。
到目前为止,一切都很好。这是我尝试的循环:
def main(query):
global pageToken, store_response
# debugging, was hoping to see print output on each iteration (I didn't)
print(pageToken)
while pageToken != "":
analytics = initialize_analyticsreporting()
response = get_report(analytics, query)
pageToken = response['reports'][0]['nextPageToken'] # < IT ALL COMES DOWN TO THIS LINE HERE
store_response['pageToken'] = response
return(False) # don't actually need the function to return anything, just append to global store_response.
然后我尝试运行它:
pageToken = "go" # can be any string to get started
store_response = {}
sessions1 = main(sessions1_qr)
发生以下情况:
print(pageToken)
行向控制台打印一次,即pageToken的初始值所以,看来我的循环只运行了一次。
盯着代码,我怀疑它与我传递给main()
的查询参数的值有关。当我最初调用main()
时,查询的值与上面的第一个代码块相同(变量sessions1_qr,具有所有API调用参数的字典)。在每次循环迭代中,都应该对此进行更新,以便将pageToken的值替换为响应nextPageToken的值。
换句话说,简而言之,我需要使用循环的前一次迭代的结果来更新循环的输入。我的逻辑显然有缺陷,因此非常感谢您的帮助。
答案 0 :(得分:1)
这是我要解决的方法:
def main(query):
global pageToken, store_response
while pageToken != "":
# debugging, was hoping to see print output on each iteration (I didn't)
print(pageToken)
analytics = initialize_analyticsreporting()
response = get_report(analytics, query)
# note that this has changed -- you were using 'pageToken' as a key
# which would overwrite each response
store_response[pageToken] = response
pageToken = response['reports'][0]['nextPageToken'] # update the pageToken
query['reportRequests'][0]['pageToken'] = pageToken # update the query
return(False) # don't actually need the function to return anything, just append to global store_response.
即手动更新查询数据结构,并使用pageToken
作为字典键存储每个响应。
大概最后一页的页面以''
作为nextPageToken
,因此循环将停止。