如何停止apply()更改列的顺序?

时间:2019-08-26 17:06:07

标签: python pandas

我有一个可重现的示例,玩具数据框:

df = pd.DataFrame({'my_customers':['John','Foo'],'email':['email@gmail.com','othermail@yahoo.com'],'other_column':['yes','no']})

print(df)

  my_customers                email other_column
0         John      email@gmail.com          yes
1          Foo  othermail@yahoo.com           no

然后我向行apply()添加一个函数,在函数内部创建一个新列:

def func(row):

    # if this column is 'yes'
    if row['other_column'] == 'yes':

        # create a new column with 'Hello' in it        
        row['new_column'] = 'Hello' 

        # return to df
        return row 

    # otherwise
    else: 

        # just return the row
        return row

然后将函数应用于df,我们可以看到顺序已更改。现在,这些列按字母顺序排列。有什么办法可以避免这种情况?我想保持原始顺序。

df = df.apply(func, axis = 1)
print(df)

                 email my_customers new_column other_column
0      email@gmail.com         John      Hello          yes
1  othermail@yahoo.com          Foo        NaN           no

为澄清起见-上面的代码太简单了

输入

df = pd.DataFrame({'my_customers':['John','Foo'],
                   'email':['email@gmail.com','othermail@yahoo.com'],
                   'api_status':['data found','no data found'],
                   'api_response':['huge json','huge json']})

  my_customers                email     api_status api_response
0         John      email@gmail.com     data found    huge json
1          Foo  othermail@yahoo.com  no data found    huge json

解析api_response。我需要在DF中创建许多新行:

def api_parse(row):

    # if we have response data

    if row['api_response'] == huge json:

        # get response for parsing

        response_data = row['api_response']

        """Let's get associated URLS first"""

        # if there's a URL section in the response

        if 'urls' in response_data .keys():

            # get all associated URLS into a list

            urls = extract_values(response_data ['urls'], 'url')

            row['Associated_Urls'] = urls


        """Get a list of jobs"""

        if 'jobs' in response_data .keys():

            # get all associated jobs and organizations into a list

            titles = extract_values(person_data['jobs'], 'title')
            organizations = extract_values(person_data['jobs'], 'organization')

            counter = 1

            # create a new column for each job

            for pair in zip(titles,organizations):

                row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'

                counter +=1


        """Get a list of education"""

        if 'educations' in response_data .keys():

            # get all degrees into list

            degrees = extract_values(response_data ['educations'], 'display')

            counter = 1

            # create a new column for each degree

            for edu in degrees:

                row['education'+'_'+str(counter)] = edu

                counter +=1


        """Get a list of social profiles from URLS we parsed earlier"""

        facebook = [i for i in urls if 'facebook' in i] or [np.nan]
        instagram = [i for i in urls if 'instagram' in i] or [np.nan]
        linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
        twitter = [i for i in urls if 'twitter' in i] or [np.nan]
        amazon = [i for i in urls if 'amazon' in i] or [np.nan]

        row['facebook'] = facebook
        row['instagram'] = instagram
        row['linkedin'] = linkedin
        row['twitter'] = twitter
        row['amazon'] = amazon

        return row 

    elif row['api_Status'] == 'No Data Found':
        # do nothing
        return row

预期输出:

  my_customers                email     api_status api_response job_1 job_2  \
0         John      email@gmail.com     data found    huge json   xyz  xyz2   
1          Foo  othermail@yahoo.com  no data found    huge json   nan  nan

  education_1  facebook other api info  
0         foo  profile1            etc  
1         nan  nan                 nan

2 个答案:

答案 0 :(得分:1)

运行apply函数后,您可以调整DataFrame中的列顺序。例如:

df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]

要减少重复的次数(即必须重新输入所有列名),可以在调用apply函数之前获取现有的列集:

columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]

根据作者对原始问题的编辑进行更新。虽然我不确定所选的数据结构(将API存储在数据帧中)是否是最佳选择,但一种简单的解决方案可能是在调用apply函数之后提取新列。

# Store the existing columns before calling apply
existing_columns = list(df.columns)

df = df.apply(func, axis = 1)

all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]

df = df[columns + new_columns]

为了优化性能,您可以将现有的列存储在set中,而不是list中,这归因于Python中设置数据结构的哈希性质,将在恒定时间内产生查找。这会将existing_columns = list(df.columns)更改为existing_columns = set(df.columns)


最后,正如@Parfait在其评论中非常友好地指出的那样,上面的代码可能会引发一些折旧警告。使用pandas.DataFrame.reindex代替df = df[columns + new_columns]将使警告消失:

new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)

答案 1 :(得分:1)

之所以会这样,是因为您没有为row["other_column"] != 'yes'分配新列的值。只需尝试:

def func(row):

    if row['other_column'] == 'yes':

        row['new_column'] = 'Hello' 
        return row 

    else: 

        row['new_column'] = '' 
        return row

df.apply(func, axis = 1)

您可以选择row["new_column"] == 'no'的值。我只是将其留空。