我有一个从API中提取数据的脚本,其中requests.get(url=url, auth=(user, password)).json()
的最终输出是“all_results”。输出为~25K行,但它包含嵌套字段。
API用于投资组合数据,children
字段是一个包含股票代码级别信息的字典(因此可能非常大)。
下面的脚本展平“all_results”并仅指定我需要的列:
final_df = pd.DataFrame()
for record in all_results:
df = pd.DataFrame(record.get('children', {}))
df['contactId'] = record.get('contactId')
df['origin'] = record.get('origin')
df['description'] = record.get('description')
final_df = final_df.append(df)
它与较小的样本完美配合,但是当试图在整个数据集上运行时,它需要HOURS。谁能提出比我目前的脚本更有效的东西?需要它比目前更快地运行。
提前谢谢!
- 完整脚本 -
user = ''
password= ""
# Starting values
start = 0
rows = 1500
base_url = 'https://....?start={0}&rows={1}'
print ("Connecting to API..")
url = base_url.format(start,rows)
req = requests.get(url=url, auth=(user, password))
print ("Extracting data..")
out = req.json()
total_records = out['other']['numFound']
print("Total records found: "+ str(total_records))
results = out['resultList']
all_results = results
print ("First " + str(rows) + " rows were extracted")
# Results will be an empty list if no more results are found
while results:
start += rows # Rebuild url based on current start
url = base_url.format(start, rows)
req = requests.get(url=url, auth=(user, password))
out = req.json()
results = out['resultList']
all_results += results
print ("Next " + str(rows) + " rows were extracted")
# All results will now contains all the responses of each request.
print("Total records returned from API: "+ str(len(all_results))) #should equal number of records in response
final_df = pd.DataFrame()
for record in all_results:
df = pd.DataFrame(record.get('children', {}))
df['contactId'] = record.get('contactId')
df['origin'] = record.get('origin')
df['description'] = record.get('description')
final_df = final_df.append(df)
final_df = final_df.reset_index()
del final_df['index']
final_df['ticker'] = final_df['identifier'].str.split('@').str.get(0) #extract ticker (anything before @)
final_df.drop_duplicates(keep='first') #removes duplicates
print('DataFrame from API created succesfully\n')
print(final_df.head(n=50))