我有一个可重现的示例,玩具数据框:
df = pd.DataFrame({'my_customers':['John','Foo'],'email':['email@gmail.com','othermail@yahoo.com'],'other_column':['yes','no']})
print(df)
my_customers email other_column
0 John email@gmail.com yes
1 Foo othermail@yahoo.com no
然后我向行apply()
添加一个函数,在函数内部创建一个新列:
def func(row):
# if this column is 'yes'
if row['other_column'] == 'yes':
# create a new column with 'Hello' in it
row['new_column'] = 'Hello'
# return to df
return row
# otherwise
else:
# just return the row
return row
然后将函数应用于df,我们可以看到顺序已更改。现在,这些列按字母顺序排列。有什么办法可以避免这种情况?我想保持原始顺序。
df = df.apply(func, axis = 1)
print(df)
email my_customers new_column other_column
0 email@gmail.com John Hello yes
1 othermail@yahoo.com Foo NaN no
为澄清起见-上面的代码太简单了
输入
df = pd.DataFrame({'my_customers':['John','Foo'],
'email':['email@gmail.com','othermail@yahoo.com'],
'api_status':['data found','no data found'],
'api_response':['huge json','huge json']})
my_customers email api_status api_response
0 John email@gmail.com data found huge json
1 Foo othermail@yahoo.com no data found huge json
解析api_response。我需要在DF中创建许多新行:
def api_parse(row):
# if we have response data
if row['api_response'] == huge json:
# get response for parsing
response_data = row['api_response']
"""Let's get associated URLS first"""
# if there's a URL section in the response
if 'urls' in response_data .keys():
# get all associated URLS into a list
urls = extract_values(response_data ['urls'], 'url')
row['Associated_Urls'] = urls
"""Get a list of jobs"""
if 'jobs' in response_data .keys():
# get all associated jobs and organizations into a list
titles = extract_values(person_data['jobs'], 'title')
organizations = extract_values(person_data['jobs'], 'organization')
counter = 1
# create a new column for each job
for pair in zip(titles,organizations):
row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'
counter +=1
"""Get a list of education"""
if 'educations' in response_data .keys():
# get all degrees into list
degrees = extract_values(response_data ['educations'], 'display')
counter = 1
# create a new column for each degree
for edu in degrees:
row['education'+'_'+str(counter)] = edu
counter +=1
"""Get a list of social profiles from URLS we parsed earlier"""
facebook = [i for i in urls if 'facebook' in i] or [np.nan]
instagram = [i for i in urls if 'instagram' in i] or [np.nan]
linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
twitter = [i for i in urls if 'twitter' in i] or [np.nan]
amazon = [i for i in urls if 'amazon' in i] or [np.nan]
row['facebook'] = facebook
row['instagram'] = instagram
row['linkedin'] = linkedin
row['twitter'] = twitter
row['amazon'] = amazon
return row
elif row['api_Status'] == 'No Data Found':
# do nothing
return row
预期输出:
my_customers email api_status api_response job_1 job_2 \
0 John email@gmail.com data found huge json xyz xyz2
1 Foo othermail@yahoo.com no data found huge json nan nan
education_1 facebook other api info
0 foo profile1 etc
1 nan nan nan
答案 0 :(得分:1)
运行apply函数后,您可以调整DataFrame
中的列顺序。例如:
df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]
要减少重复的次数(即必须重新输入所有列名),可以在调用apply函数之前获取现有的列集:
columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]
根据作者对原始问题的编辑进行更新。虽然我不确定所选的数据结构(将API存储在数据帧中)是否是最佳选择,但一种简单的解决方案可能是在调用apply函数之后提取新列。
# Store the existing columns before calling apply
existing_columns = list(df.columns)
df = df.apply(func, axis = 1)
all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]
df = df[columns + new_columns]
为了优化性能,您可以将现有的列存储在set
中,而不是list
中,这归因于Python中设置数据结构的哈希性质,将在恒定时间内产生查找。这会将existing_columns = list(df.columns)
更改为existing_columns = set(df.columns)
。
最后,正如@Parfait在其评论中非常友好地指出的那样,上面的代码可能会引发一些折旧警告。使用pandas.DataFrame.reindex
代替df = df[columns + new_columns]
将使警告消失:
new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)
答案 1 :(得分:1)
之所以会这样,是因为您没有为row["other_column"] != 'yes'
分配新列的值。只需尝试:
def func(row):
if row['other_column'] == 'yes':
row['new_column'] = 'Hello'
return row
else:
row['new_column'] = ''
return row
df.apply(func, axis = 1)
您可以选择row["new_column"] == 'no'
的值。我只是将其留空。