基于两个相似的列对pandas数据帧进行排序,但如果另一个具有值,则一个将是NaN

时间:2018-02-09 11:33:27

标签: python pandas

我有一个合并的df,它有2个实验ID - experiment_a和experiment_b

它们在EXPT_YEAR_NUM的一般命名法中,但有些已添加,没有一年而不是其他值。在此df中,在experiment_a中有值,experiment_b = NaN,反之亦然。

即:

experiment_a    experiment_b
EXPT_2011_06     NaN
NaN              EXPT_2011_07

如何对experiment_a和_b的升序值进行排序,而不是在experiment_a上升,其中_b具有所有NaN值,然后在experiment_a具有NaN值时使用experiment_b升序?

当我使用sort_values时会发生这种情况:

df = df.sort_values(['experiment_a', 'experiment_b'])

它显然只是排序_a,然后是_b。

2 个答案:

答案 0 :(得分:1)

我认为access_rights需要fillna,然后按argsort获取排序值的索引,最后按iloc选择 - 输出是排序列:

Series

<强>详细

print (df)
   experiment_a  experiment_b
0  EXPT_2011_06           NaN
1  EXPT_2010_06           NaN
2           NaN  EXPT_2011_07

df = df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()]
print (df)
   experiment_a  experiment_b
1  EXPT_2010_06           NaN
0  EXPT_2011_06           NaN
2           NaN  EXPT_2011_07

我测试了更多解决方案,print (df['experiment_a'].fillna(df['experiment_b'])) 0 EXPT_2011_06 1 EXPT_2010_06 2 EXPT_2011_07 Name: experiment_a, dtype: object print (df['experiment_a'].fillna(df['experiment_b']).argsort()) 0 1 1 0 2 2 Name: experiment_a, dtype: int64 性能更好一些,但主要取决于数据:

np.where

答案 1 :(得分:0)

首先构建一个列:

def one_day(row):
    yday_perf = df.loc[(df['product'] == row['product']) & (df['dates'] == (row['dates'] + pd.Timedelta(days=-1))), 'sales']    
    return yday_perf.values[0] if not yday_perf.empty else -1

def two_day(row):
    twoday_perf = df.loc[(df['product'] == row['product']) & (df['dates'] >= (row['dates'] + pd.Timedelta(days=-2))) & (df['dates'] < row['dates']), 'sales']
    return twoday_perf.sum() if len(twoday_perf) >=1 else -1

df['yesterday_sales'] = df.apply(one_day, axis=1)
df['last_two_days_sales'] = df.apply(two_day, axis=1)

#        dates product  sales  yesterday_sales  last_two_days_sales
# 0 2017-11-20       A      1               -1                   -1
# 1 2017-11-20       B      2               -1                   -1
# 2 2017-11-20       C      3               -1                   -1
# 3 2017-11-21       A      4                1                    1
# 4 2017-11-21       B      5                2                    2
# 5 2017-11-21       C      6                3                    3
# 6 2017-11-22       A      7                4                    5
# 7 2017-11-22       B      8                5                    7
# 8 2017-11-22       C      9                6                    9

然后指数:

key = df.experiment_a.where(df.experiment_a.notnull(), df.experiment_b)

最后:

idx = key.argsort()