+ =使用不存在的数据帧进行操作

时间:2016-08-11 21:17:27

标签: python pandas

df_pairs:

city1   city2
0   sfo yyz
1   sfo yvr
2   sfo dfw
3   sfo ewr

输出df_pairs.to_dict('records'):

[{'city1': 'sfo', 'city2': 'yyz'},
 {'city1': 'sfo', 'city2': 'yvr'},
 {'city1': 'sfo', 'city2': 'dfw'},
 {'city1': 'sfo', 'city2': 'ewr'}]

data_df:

    city    2016-02-02 00:00:00 2016-02-05 00:00:00 2016-02-01 00:00:00 2016-02-04 00:00:00 2016-02-03 00:00:00
0   sfo -33.63  -62.34  -35.70  -31.84  -33.87
1   yyz -24.31  -51.17  -22.07  -31.00  -23.00
2   yvr -24.31  -51.17  -22.07  -31.00  -23.00
3   dfw -32.17  -43.77  -34.84  0.27    -11.49
4   ewr -28.87  -59.66  -28.40  -32.94  -29.06

输出data_df.to_dict('记录')

[{'city': 'sfo',
  Timestamp('2016-02-02 00:00:00'): -33.63,
  Timestamp('2016-02-05 00:00:00'): -62.34,
  Timestamp('2016-02-01 00:00:00'): -35.7,
  Timestamp('2016-02-04 00:00:00'): -31.84,
  Timestamp('2016-02-03 00:00:00'): -33.87},
 {'city': 'yyz',
  Timestamp('2016-02-02 00:00:00'): -24.31,
  Timestamp('2016-02-05 00:00:00'): -51.17,
  Timestamp('2016-02-01 00:00:00'): -22.07,
  Timestamp('2016-02-04 00:00:00'): -31.0,
  Timestamp('2016-02-03 00:00:00'): -23.0},
 {'city': 'yvr',
  Timestamp('2016-02-02 00:00:00'): -24.31,
  Timestamp('2016-02-05 00:00:00'): -51.17,
  Timestamp('2016-02-01 00:00:00'): -22.07,
  Timestamp('2016-02-04 00:00:00'): -31.0,
  Timestamp('2016-02-03 00:00:00'): -23.0},
 {'city': 'dfw',
  Timestamp('2016-02-02 00:00:00'): -32.17,
  Timestamp('2016-02-05 00:00:00'): -43.77,
  Timestamp('2016-02-01 00:00:00'): -34.84,
  Timestamp('2016-02-04 00:00:00'): 0.27,
  Timestamp('2016-02-03 00:00:00'): -11.49},
 {'city': 'ewr',
  Timestamp('2016-02-02 00:00:00'): -28.87,
  Timestamp('2016-02-05 00:00:00'): -59.66,
  Timestamp('2016-02-01 00:00:00'): -28.4,
  Timestamp('2016-02-04 00:00:00'): -32.94,
  Timestamp('2016-02-03 00:00:00'): -29.06}]

所以我有一个名为df_pairs的df。对于df_pairs中的每一对,我想在data_df中查找city1和city2,从另一个中减去1,取差异时间序列的符号,分别为正负号,分别为正负差异值并计算每个data_df列的总和。

diff_df_sign_pos = diff_df_sign_neg = diff_df_pos = diff_df_neg = 0

for i in range(0,len(data_df.columns)):
    a = pd.merge(df_pairs[['city1','city2']], data_df.ix[:, [i]], left_on='city1', right_index=True, how='left').set_index(['city1', 'city2'])
    b = pd.merge(df_pairs[['city1','city2']], data_df.ix[:, [i]], left_on='city2', right_index=True, how='left').set_index(['city1', 'city2'])
    diff_df = b - a
    diff_df_sign = np.sign(diff_df)
    diff_df_sign_pos+= diff_df_sign.clip(lower=0)
    diff_df_sign_neg+= diff_df_sign.clip(upper=0)
    diff_df_pos+= diff_df.clip(lower=0)
    diff_df_neg+= diff_df.clip(upper=0)

如果您运行上述代码,您会发现diff_df_sign_posdiff_df_sign_negdiff_df_posdiff_df_neg的最终值是NaN。

例如,diff_df_sign_pos的最终结果应如下所示:

               2016-02-03 00:00:00
city1    city2  
sfo      yyz    5.0
         yvr    5.0
         dfw    5.0
         ewr    4.0

这告诉我们yyz,yvr,dfw和sfo之间的所有5个差异都是正面的。

2 个答案:

答案 0 :(得分:1)

你为什么不这样做:

df_city1 = pd.merge(df_pairs['city1'], data_df, left_on='city1', right_on='city', how='left')
df_city2 = pd.merge(df_pairs['city2'], data_df, left_on='city2', right_on='city', how='left')
diff = df_city2.subtract(df_city1, fill_value=0)
pos_sum = diff[diff >= 0].sum(axis=1)
neg_sum = diff[diff <  0].sum(axis=1)

而不是循环遍历所有列,合并2 *(列数)次,更不用说索引了,那么np.sign.clip那个复杂的位......你的{{1}和df_pairs有一对一的对应关系,对吧?

答案 1 :(得分:0)

给它一个运行:

取出初始变量,摆脱for循环。

a = pd.merge(df_pairs, data_df, left_on='city1', right_on='city', how='left').set_index(['city1', 'city2'])
b = pd.merge(df_pairs, data_df, left_on='city2', right_on='city', how='left').set_index(['city1', 'city2'])
del a['city']
del b['city']

现在进行一次计算,并在每一行(轴= 1)之间求和

diff_df = b - a
diff_df_sign = np.sign(diff_df)
diff_df_sign_pos = diff_df_sign.clip(lower=0).sum(axis=1)
diff_df_sign_neg = diff_df_sign.clip(upper=0).sum(axis=1)
diff_df_pos = diff_df.clip(lower=0).sum(axis=1)
diff_df_neg = diff_df.clip(upper=0).sum(axis=1)

这看起来像你想要的输出吗?

city1  city2
sfo    yyz      5
       yvr      5
       dfw      5
       ewr      4
dtype: float64

city1  city2
sfo    yyz      0
       yvr      0
       dfw      0
       ewr     -1
dtype: float64

city1  city2
sfo    yyz      45.83
       yvr      45.83
       dfw      75.38
       ewr      19.55
dtype: float64

city1  city2
sfo    yyz      0.0
       yvr      0.0
       dfw      0.0
       ewr     -1.1
dtype: float64