Question

我正在使用Pandas从返回JSON对象的API中获取大约200万条记录。 API的限制是一次只能返回5000个JSON对象，因此我遍历API调用以获取JSON。这些是我遵循的步骤： 1.获取列表中的所有record_id。 2.通过将record_id分成5000个大块来创建API调用（URL）。 3.遍历创建的URL以获取JSON。 4.创建一个上面获取的JSON列表。 5.使用pd.io.json.json_normalize创建数据框。

问题是，如果我超出了要提取的记录的某个限制，则我的内存不足。我正在尝试使用DASK来解决内存问题。但是，我无法弄清楚如何使用DASK袋执行与列表类似的功能（例如追加）。或者，如何将迭代API调用返回的更多JSON添加到同一DASK包中？

这是我正在使用的代码，对于较小的数据集，它可以正常工作：

import pandas as pd
import json
import requests
import getpass

# Specify the date range and system for which the recordIDs need to be fetched
recordIDsURL = 'http://example.com:8071/records/getIds?system=ABC&daterange=2019-01-15,2019-10-15'

# Specify the record service API which returns the record info for provided record ids
recordServiceURL = 'http://example:8071/records/'

# Get the recordIds for the provided date range and system
request = requests.get(recordIDsURL, auth = requests.auth.HTTPBasicAuth(username, password))

# Put the recordIds into a list
listid = request.json()

# Divide the recordIDs into smaller lists containing 5000 recordIDs 
listChunks = [listid[x:x+5000] for x in range(0, len(listid), 5000)]

# Make a list for disctinct URLs for calling the API
url = [0 for i in range(len(listChunks))]

# Make a list for storing the result of the URL calls
recordRequest = [0 for i in range(len(listChunks))]

# Make a list for converting the result of the URL calls into a list of JSONs
jsonList = [0 for i in range(len(listChunks))]

# Iterate over the URL calls 
for i in range(len(listChunks)):
    url[i] = recordServiceURL + (','.join(listChunks[i]))
    recordRequest[i] = requests.get(url[i], auth = requests.auth.HTTPBasicAuth(username, password))
    jsonList[i] = recordRequest[i].json()

# Merge the JSON list into a single JSON to load into DF
mergeJson = []
for i in jsonList:
    mergeJson += i

df = pd.io.json.json_normalize(mergeJson)

简而言之，我希望使用DASK bag和DASK数据框代替上面代码中的python列表和pandas数据框。

Answer 1

您可以使用dask.bag.concat函数将许多Dask Bag串联起来

如何将DASK袋附加到另一个DASK袋？

1 个答案: