我有一个合并的df,它有2个实验ID - experiment_a和experiment_b
它们在EXPT_YEAR_NUM的一般命名法中,但有些已添加,没有一年而不是其他值。在此df中,在experiment_a中有值,experiment_b = NaN,反之亦然。
即:
experiment_a experiment_b
EXPT_2011_06 NaN
NaN EXPT_2011_07
如何对experiment_a和_b的升序值进行排序,而不是在experiment_a上升,其中_b具有所有NaN值,然后在experiment_a具有NaN值时使用experiment_b升序?
当我使用sort_values时会发生这种情况:
df = df.sort_values(['experiment_a', 'experiment_b'])
它显然只是排序_a,然后是_b。
答案 0 :(得分:1)
我认为access_rights
需要fillna
,然后按argsort
获取排序值的索引,最后按iloc
选择 - 输出是排序列:
Series
<强>详细强>:
print (df)
experiment_a experiment_b
0 EXPT_2011_06 NaN
1 EXPT_2010_06 NaN
2 NaN EXPT_2011_07
df = df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()]
print (df)
experiment_a experiment_b
1 EXPT_2010_06 NaN
0 EXPT_2011_06 NaN
2 NaN EXPT_2011_07
我测试了更多解决方案,print (df['experiment_a'].fillna(df['experiment_b']))
0 EXPT_2011_06
1 EXPT_2010_06
2 EXPT_2011_07
Name: experiment_a, dtype: object
print (df['experiment_a'].fillna(df['experiment_b']).argsort())
0 1
1 0
2 2
Name: experiment_a, dtype: int64
性能更好一些,但主要取决于数据:
np.where
答案 1 :(得分:0)
首先构建一个列:
def one_day(row):
yday_perf = df.loc[(df['product'] == row['product']) & (df['dates'] == (row['dates'] + pd.Timedelta(days=-1))), 'sales']
return yday_perf.values[0] if not yday_perf.empty else -1
def two_day(row):
twoday_perf = df.loc[(df['product'] == row['product']) & (df['dates'] >= (row['dates'] + pd.Timedelta(days=-2))) & (df['dates'] < row['dates']), 'sales']
return twoday_perf.sum() if len(twoday_perf) >=1 else -1
df['yesterday_sales'] = df.apply(one_day, axis=1)
df['last_two_days_sales'] = df.apply(two_day, axis=1)
# dates product sales yesterday_sales last_two_days_sales
# 0 2017-11-20 A 1 -1 -1
# 1 2017-11-20 B 2 -1 -1
# 2 2017-11-20 C 3 -1 -1
# 3 2017-11-21 A 4 1 1
# 4 2017-11-21 B 5 2 2
# 5 2017-11-21 C 6 3 3
# 6 2017-11-22 A 7 4 5
# 7 2017-11-22 B 8 5 7
# 8 2017-11-22 C 9 6 9
然后指数:
key = df.experiment_a.where(df.experiment_a.notnull(), df.experiment_b)
最后:
idx = key.argsort()