有条件地更改熊猫的数据框索引

时间:2018-08-16 14:24:46

标签: python pandas

我有一个多索引panadas数据框,如下所示。这只是我遇到的问题的一个例子。实际上,此数据帧可能非常大,并且包含许多出现此问题的事件。

在第一行,index2的值为2,在最后一行index1的值为2。事实并非如此。结果,我需要将最后一行的index1更改为1,以便所有行都是index1 1的一部分。

                 given_name
 index1   index2    
 664627    766206         1
          1297240         1
          1429530         1
 569874    396418         1
 766206   1429531         1
 169874   3697813         1
 123456   1598742         1
 1598742  19543864        1

所需的输出应如下所示:

                 given_name
 index1   index2    
 664627    766206         1
          1297240         1
          1429530         1
          1429531         1
 569874    396418         1
 169874   3697813         1
 123456   1598742         1
         19543864         1

理想情况下,解决方案应向量化且快速。我不必使用索引。数据框可以使用reset_index()并作为列工作,然后将这些列重置为索引。

2 个答案:

答案 0 :(得分:0)

我认为TestCommand的第一级需要get_level_values,由to_series转换为MultiIndex,以通过{创建的正向填充替换Series {3}}与mask,最后isin

NaN

答案 1 :(得分:0)

我找到了解决方法:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: string1 = """index1,index2,given_name
   ...: 1,2,1
   ...: 1,3,1
   ...: 1,4,1
   ...: 2,5,1
   ...: 6,7,1
   ...: 6,8,1
   ...: 7,9,1
   ...: 9,10,1
   ...: 10,11,1
   ...: 5,12,1
   ...: 12,13,1
   ...: 13,14,1"""
   ...: 
   ...: df = pd.read_csv(pd.compat.StringIO(string1), index_col=[0,1])
   ...: 

In [2]: df
Out[2]: 
               given_name
index1 index2            
1      2                1
       3                1
       4                1
2      5                1
6      7                1
       8                1
7      9                1
9      10               1
10     11               1
5      12               1
12     13               1
13     14               1


def find_fixing_rows(df):

    df = df.reset_index()

    # getting indexes of zeroth  and first index
    level_zero_indexs = np.unique(df.index1.values)
    level_one_indexs = np.unique(df.index2.values)

    # finding indexes that appear in both levels, these are ones that need fixing
    intersect_index = np.intersect1d(level_zero_indexs, level_one_indexs)

    # getting rows that need to be fixed using intersect_index
    df_need_fix = df[df.index2.isin(intersect_index)]

    return df_need_fix


def combine_missed_matches(df):

    #df_need_fix = find_fixing_rows(df)

    df = df.reset_index()

    # getting indexes of zeroth  and first index
    level_zero_indexs = np.unique(df.index1.values)
    level_one_indexs = np.unique(df.index2.values)

    # finding indexes that appear in both levels, these are ones that need fixing
    intersect_index = np.intersect1d(level_zero_indexs, level_one_indexs)

    # getting rows that need to be fixed using intersect_index
    df_need_fix = df[df.index2.isin(intersect_index)]

    # joining  fixed rows onto original dataframe to allow changing of indexes
    df_with_need_fix_join = pd.merge(df,
                                     df_need_fix,
                                     left_on='index1',
                                     right_on='index2',
                                     how='left')

    # logic to swap indexs
    df_with_need_fix_join['index1_x'] = np.where(df_with_need_fix_join.index1_y.notnull(),
                                                 df_with_need_fix_join.index1_y,
                                                 df_with_need_fix_join.index1_x)

    # dropping columns, renaming and tidying
    df_with_need_fix_join = df_with_need_fix_join.drop(['index1_y',
                                                        'index2_y',
                                                        'given_name_y'],
                                                       axis=1)

    df_with_need_fix_join = df_with_need_fix_join.rename(columns={
                                'index1_x' : 'index1',
                                'index2_x' : 'index2',
                                'given_name_x' : 'given_name'
                            })

    df_with_need_fix_join.index1 = df_with_need_fix_join.index1.astype(np.int)

    df_with_need_fix_join = df_with_need_fix_join.set_index(['index1', 'index2'])

    return  df_with_need_fix_join

def fix_missing_matches(df, condition=True):

    while condition:

        condition = find_fixing_rows(df).shape[0] > 0
        df = combine_missed_matches(df)

    df = df.sort_index()

    return df

In [4]: df_fix = fix_missing_matches(df)
   ...: 
   ...: df_fix
   ...: 
Out[4]: 
               given_name
index1 index2            
1      2                1
       3                1
       4                1
       5                1
       12               1
       13               1
       14               1
6      7                1
       8                1
       9                1
       10               1
       11               1