背景:我有一个包含个人姓名和地址的数据框。我正在尝试对与我的数据框中每个人相关联的人进行编目,因此我通过外部API运行数据框中的每一行/记录,该API返回与该个人相关联的人员列表。我们的想法是编写一系列调用API的函数,返回亲属列表,并将列表中的每个名称附加到原始数据框中的不同列。代码最终将被并行化。
数据框:
import pandas as pd
df = pd.DataFrame({
'first_name': ['Kyle', 'Ted', 'Mary', 'Ron'],
'last_name': ['Smith', 'Jones', 'Johnson', 'Reagan'],
'address': ['123 Main Street', '456 Maple Street', '987 Tudor Place', '1600 Pennsylvania Avenue']},
columns = ['first_name', 'last_name', 'address'])
第一个函数,它调用API并返回一个名称列表:
import requests
import json
import numpy as np
from multiprocessing import Pool
def API_call(row):
api_key = '123samplekey'
first_name = str(row['First_Name'])
last_name = str(row['Last_Name'])
address = str(row['Street_Address'])
url = 'https://apiaddress.com/' + '?first_name=' + first_name + '?last_name=' + last_name + '?address' = address + '?api_key' + api_key
response = requests.get(url)
JSON = response.json()
name_list = []
for index, person in enumerate(JSON['people']):
name = JSON['people'].get('name')
name_list.append(name)
return name_list
此功能运作良好。对于数据框中的每个人,返回一个家人/朋友列表。因此,对于Kyle Smith,函数返回[Heather Smith, Dan Smith]
,对于Ted Jones,函数返回[Al Jones, Karen Jones, Tiffany Jones, Natalie Jones]
,依此类推数据框中的每一行/记录。
问题:我正在努力编写一个后续函数,该函数将遍历返回的列表,并将每个值附加到与数据框中搜索到的名称对应的唯一列。我希望函数返回一个如下所示的数据库:
First_Name | Last_Name | Street_Address | relative1_name | relative2_name | relative3_name | relative4_name
-----------------------------------------------------------------------------------------------------------------------------
Kyle | Smith | 123 Main Street | Heather Smith | Dan Smith | |
Ted | Jones | 456 Maple Street | Al Jones | Karen Jones | Tiffany Jones | Natalie Jones
Mary | Johnson | 987 Tudor Place | Kevin Johnson | | |
Ron | Reagan | 1600 Pennsylvania Avenue | Nancy Reagan | Patti Davis | Michael Reagan | Christine Reagan
注意:目标是对所有内容进行矢量化,以便我可以使用apply
方法并最终并行运行整个事物。以下代码中的某些内容在过去对我有用,当“API_call”函数返回单个对象而不是需要迭代/映射的列表时:
def API_call(row):
# all API parameters
url = 'https//api.com/parameters'
response = request.get(url)
JSON = response.json()
single_object = JSON['key1']['key2'].get('key3')
return single_object
def second_function(data):
data['single_object'] = data.apply(API_call, axis =1)
return data
def parallelize(dataframe, function):
df_splits = np.array_split(dataframe, 10)
pool = Pool(4)
df_whole = pd.concat(pool.map(function, df_splits))
pool.close()
pool.join()
return df_whole
parallelize(df, second_function)
问题是我无法编写一个可矢量化的函数(second_function),它将API返回的列表中的名称映射到原始数据帧中的唯一列。在此先感谢您的帮助!
答案 0 :(得分:0)
import pandas as pd
def make_relatives_frame(relatives):
return pd.DataFrame(data=[relatives],
columns=["relative%i_name" % x for x in range(1, len(relatives) + 1)])
# example output from an API call
df_names = pd.DataFrame(data=[["Kyle", "Smith"]], columns=["First_Name", "Last_Name"])
relatives = ["Heather Smith", "Dan Smith"]
df_relatives = make_relatives_frame(relatives)
df_names[df_relatives.columns] = df_relatives
# example output from another API Call with more relatives
df_names2 = pd.DataFrame(data=[["John", "Smith"]], columns=["First_Name", "Last_Name"])
relatives2 = ["Heath Smith", "Daryl Smith", "Scott Smith"]
df_relatives2 = make_relatives_frame(relatives2)
df_names2[df_relatives2.columns] = df_relatives2
# example of stacking the outputs
total_df = df_names.append(df_names2)
print total_df
上面的代码可以帮助您入门。显然它只是一个代表性的例子,但你应该能够重构它以适应你的特定用例。