我要从Microsoft Academic Knowledge API中提取数据,然后使用json响应作为字典来提取我需要的信息。在执行此操作时,我将信息添加到numpy数组中,最后将其更改为要导出的熊猫数据框。该程序可以正常运行,但是需要花费大量时间才能运行。不过,它似乎在运行时会变慢,因为在循环的前几次中,它只需要几秒钟,但随后需要几分钟。
我已经尽我所能简化了if else语句,这虽然有所帮助,但不足以起到很大作用。我也尽可能地减少了查询API的次数。每个查询只能返回1000个结果,但是我需要大约35000个结果。
rel_info = np.array([("Title", "Author_Name", "Jornal_Published_In", "Date")])
for l in range(0, loops): # loops is defined above to be 35
offset = 1000 * l
# keep track of progress
print("Progress:" + str(round((offset/total_res)*100, 2)) + "%")
# get data with request to MAK. 1000 is the max count
url = "https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Composite(AA.AfN=='brigham young university'),Y>=1908)&model=latest&count=1000&offset="+str(offset)+"&attributes=Ti,D,AA.DAfN,AA.DAuN,J.JN"
response = req.get(url + '&subscription-key={key}')
data = response.json()
for i in range(0, len(data["entities"])):
new_data = data["entities"][i]
# get new data
new_title = new_data["Ti"] # get title
if 'J' not in new_data: # get journal account for if keys are not in dictionaries
new_journ = ""
else:
new_journ = new_data["J"]["JN"] or ""
new_date = new_data["D"] # get date
new_auth = "" # get authors only affiliated with BYU account for if keys are not in dictionary
for j in range(0, len(new_data["AA"])):
if 'DAfN' not in new_data["AA"][j]:
new_auth = new_auth + ""
else:
if new_data["AA"][j]["DAfN"] == "Brigham Young University" and new_auth == "": # posibly combine conditionals to make less complex
new_auth = new_data["AA"][j]["DAuN"]
elif new_data["AA"][j]["DAfN"] == "Brigham Young University" and new_auth != "":
new_auth = new_auth +", "+ new_data["AA"][j]["DAuN"]
# keep adding new data to whole dataframe
new_info = np.array([(new_title, new_auth, new_journ, new_date)])
rel_info = np.vstack((rel_info, new_info))
答案 0 :(得分:0)
尝试使用concurrent.futures
在工作线程池中获取结果,如下所示:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor() as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
答案 1 :(得分:0)
最终,我通过更改添加到收集的大量数据中的方式来解决此问题。我没有在每次迭代中添加一行数据,而是构建了一个临时数组来容纳1000行数据,然后将这个临时数组添加到完整的数据中。与之前的43分钟相比,这将运行时间减少到大约一分钟。
rel_info = np.array([("Title", "Author_Name", "Jornal_Published_In", "Date")])
for req_num in range(0, loops):
offset = 1000 * req_num
# keep track of progress
print("Progress:" + str(round((offset/total_res)*100, 2)) + "%")
# get data with request to MAK. 1000 is the max count
url = "https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Composite(AA.AfN=='brigham young university'),Y>=1908)&model=latest&count=1000&offset="+str(offset)+"&attributes=Ti,D,AA.DAfN,AA.DAuN,J.JN"
response = req.get(url + '&subscription-key={key}')
data = response.json()
for i in range(0, len(data["entities"])):
new_data = data["entities"][i]
# get new data
new_title = new_data["Ti"] # get title
if 'J' not in new_data: # get journal account for if keys are not in dictionaries
new_journ = ""
else:
new_journ = new_data["J"]["JN"] or ""
new_date = new_data["D"] # get date
new_auth = "" # get authors only affiliated with BYU account for if keys are not in dictionary
for j in range(0, len(new_data["AA"])):
if 'DAfN' not in new_data["AA"][j]:
new_auth = new_auth + ""
else:
if new_data["AA"][j]["DAfN"] == "Brigham Young University" and new_auth == "": # posibly combine conditionals to make less complex
new_auth = new_data["AA"][j]["DAuN"]
elif new_data["AA"][j]["DAfN"] == "Brigham Young University" and new_auth != "":
new_auth = new_auth +", "+ new_data["AA"][j]["DAuN"]
# here are the changes
# keep adding to a temporary array for 1000 entities
new_info = np.array([(new_title, new_auth, new_journ, new_date)])
if (i == 0): work_stack = new_info
else: work_stack = np.vstack((work_stack, new_info))
# add temporary array to whole array (this is to speed up the program)
rel_info = np.vstack((rel_info, work_stack))