将列表中的每个元素映射到pandas dataframe

时间:2016-09-28 15:27:10

标签: python list function pandas vectorization

背景:我有一个包含个人姓名和地址的数据框。我正在尝试对与我的数据框中每个人相关联的人进行编目,因此我通过外部API运行数据框中的每一行/记录,该API返回与该个人相关联的人员列表。我们的想法是编写一系列调用API的函数,返回亲属列表,并将列表中的每个名称附加到原始数据框中的不同列。代码最终将被并行化。

数据框:

import pandas as pd

df = pd.DataFrame({
'first_name': ['Kyle', 'Ted', 'Mary', 'Ron'],
'last_name': ['Smith', 'Jones', 'Johnson', 'Reagan'],
'address': ['123 Main Street', '456 Maple Street', '987 Tudor Place', '1600 Pennsylvania Avenue']},
columns = ['first_name', 'last_name', 'address'])

第一个函数,它调用API并返回一个名称列表:

import requests
import json
import numpy as np
from multiprocessing import Pool

def API_call(row):
    api_key = '123samplekey'
    first_name = str(row['First_Name'])
    last_name = str(row['Last_Name'])
    address = str(row['Street_Address'])
    url = 'https://apiaddress.com/' + '?first_name=' + first_name + '?last_name=' + last_name + '?address' = address + '?api_key' + api_key
    response = requests.get(url)
    JSON = response.json()
    name_list = []
    for index, person in enumerate(JSON['people']):
        name = JSON['people'].get('name')
        name_list.append(name)
    return name_list

此功能运作良好。对于数据框中的每个人,返回一个家人/朋友列表。因此,对于Kyle Smith,函数返回[Heather Smith, Dan Smith],对于Ted Jones,函数返回[Al Jones, Karen Jones, Tiffany Jones, Natalie Jones],依此类推数据框中的每一行/记录。

问题:我正在努力编写一个后续函数,该函数将遍历返回的列表,并将每个值附加到与数据框中搜索到的名称对应的唯一列。我希望函数返回一个如下所示的数据库:

First_Name | Last_Name  | Street_Address           | relative1_name  | relative2_name  | relative3_name   | relative4_name
-----------------------------------------------------------------------------------------------------------------------------
Kyle       | Smith      | 123 Main Street          | Heather Smith   | Dan Smith       |                  |
Ted        | Jones      | 456 Maple Street         | Al Jones        | Karen Jones     | Tiffany Jones    | Natalie Jones
Mary       | Johnson    | 987 Tudor Place          | Kevin Johnson   |                 |                  |
Ron        | Reagan     | 1600 Pennsylvania Avenue | Nancy Reagan    | Patti Davis     | Michael Reagan   | Christine Reagan

注意:目标是对所有内容进行矢量化,以便我可以使用apply方法并最终并行运行整个事物。以下代码中的某些内容在过去对我有用,当“API_call”函数返回单个对象而不是需要迭代/映射的列表时:

def API_call(row):
    # all API parameters
    url = 'https//api.com/parameters'
    response = request.get(url)
    JSON = response.json()
    single_object = JSON['key1']['key2'].get('key3')
    return single_object

def second_function(data):
    data['single_object'] = data.apply(API_call, axis =1)
    return data

def parallelize(dataframe, function):
    df_splits = np.array_split(dataframe, 10)
    pool = Pool(4)
    df_whole = pd.concat(pool.map(function, df_splits))
    pool.close()
    pool.join()
    return df_whole

parallelize(df, second_function)

问题是我无法编写一个可矢量化的函数(second_function),它将API返回的列表中的名称映射到原始数据帧中的唯一列。在此先感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

import pandas as pd


def make_relatives_frame(relatives):

    return pd.DataFrame(data=[relatives],
                        columns=["relative%i_name" % x for x in range(1, len(relatives) + 1)])

# example output from an API call
df_names = pd.DataFrame(data=[["Kyle", "Smith"]], columns=["First_Name", "Last_Name"])
relatives = ["Heather Smith", "Dan Smith"]
df_relatives = make_relatives_frame(relatives)
df_names[df_relatives.columns] = df_relatives

# example output from another API Call with more relatives
df_names2 = pd.DataFrame(data=[["John", "Smith"]], columns=["First_Name", "Last_Name"])
relatives2 = ["Heath Smith", "Daryl Smith", "Scott Smith"]
df_relatives2 = make_relatives_frame(relatives2)
df_names2[df_relatives2.columns] = df_relatives2

# example of stacking the outputs
total_df = df_names.append(df_names2)

print total_df

上面的代码可以帮助您入门。显然它只是一个代表性的例子,但你应该能够重构它以适应你的特定用例。