两个数据仓库之间的交叉操作

时间:2019-01-05 11:24:12

标签: python-2.7 dataframe outer-join quantitative-finance

两个相当大的数据帧df1df2的列很多都带有浮点值。 对于df1df2 中相同的索引和列名称,分别通过以下操作创建两个新的数据框simple_ret_dflog_ret_df

  1. (1 - df1 / df2)。可以说这个新数据框为simple_ret_df
  2. ln(df1 / df2)。可以说这个新数据框为log_ret_df

如果数据帧df1df2中的任何条目丢失\ nan \ 0,则计算数据帧中的对应条目应为nan。可以如下所示生成示例数据帧df1df2

import numpy as np
import pandas as pd
df1 = pd.DataFrame(10*(2+np.random.randn(500, 3)), columns=list('ABC'))
df2 = pd.DataFrame(10*(2+np.random.randn(500, 3)), columns=list('CDA'))
df1.drop(df1.index[[1, 4, 284, 354, 498]], inplace=True)
df2.drop(df2.index[[0, 98, 159]], inplace=True)
df1.loc[2, 'B'] = np.nan
df1.loc[5, 'C'] = np.nan
df1.loc[3, 'A'] = np.nan
df2.loc[5, 'C'] = np.nan
df2.loc[1, 'D'] = np.nan
df2.loc[2, 'A'] = np.nan

输入数据帧df1df2的示例如下:

df1.head()
      A     B     C 
0  14.0  31.3  35.5
2  24.2   NaN  27.6
3   NaN  13.1  16.0
5  28.2   8.8   NaN
6  17.7  18.0   7.9
df2.head()
      C     D     A
1  15.1   NaN  27.0
2  20.9  29.4   NaN
3  27.8  29.7  22.9
4  19.0  13.5  21.0
5   NaN  21.4  12.0

相应的样本输出数据帧simple_ret_dflog_ret_df如下:

simple_ret_df.head(6)
          A    B        C     D
0       NaN  NaN      NaN   NaN
1       NaN  NaN      NaN   NaN
2       NaN  NaN  -0.3206   NaN
3       NaN  NaN   0.4245   NaN
4       NaN  NaN      NaN   NaN
5   -0.4750  NaN      NaN   NaN

log_ret_df.head(6)
          A    B        C     D
0       NaN  NaN      NaN   NaN
1       NaN  NaN      NaN   NaN
2       NaN  NaN   0.2781   NaN
3       NaN  NaN  -0.5524   NaN
4       NaN  NaN      NaN   NaN
5    0.3887  NaN      NaN   NaN

2 个答案:

答案 0 :(得分:0)

我有点在评论中回答我的问题。这是为您提供的解决方案。我是在Python3上制作的,而您的标签是python 2,因此您可能需要更改一些代码。

这是从头开始的全部代码。

import numpy as np
import pandas as pd
df1 = pd.DataFrame(10*(2+np.random.randn(500, 3)), columns=list('ABC'))
df2 = pd.DataFrame(10*(2+np.random.randn(500, 3)), columns=list('CDA'))
df1.drop(df1.index[[1, 4, 284, 354, 498]], inplace=True)
df2.drop(df2.index[[0, 98, 159]], inplace=True)
df1.loc[2, 'B'] = np.nan
df1.loc[5, 'C'] = np.nan
df1.loc[3, 'A'] = np.nan
df2.loc[5, 'C'] = np.nan
df2.loc[1, 'D'] = np.nan
df2.loc[2, 'A'] = np.nan

df1.head(10)

    A   B   C
0   20.438695   18.114421   20.445370
2   12.789906   NaN 3.988319
3   NaN 11.026463   9.421921
5   19.919580   7.462012    NaN
6   21.290647   23.952295   -10.354758
7   8.447708    14.710224   25.499204
8   16.603850   24.862611   20.354342
9   6.088232    15.066117   22.906491
10  19.621493   15.877428   15.149765
11  2.052592    9.031476    19.531663

df2.head(10)

C   D   A
1   15.159127   NaN 29.432163
2   23.449304   8.393440    NaN
3   6.057011    32.881258   21.033391
4   31.162671   4.128745    23.264304
5   NaN 32.796018   11.171984
6   32.019817   14.603303   33.106655
7   17.566806   26.804403   12.421421
8   30.121336   46.520462   41.934098
9   13.498463   30.170049   24.221281
10  19.554489   28.238385   26.284620


    # Merge the dataframes
merged_df = df1.merge(df2, right_index = True, left_index = True, how = 'outer')
# Create dataframes for each formula
final_df1 = pd.DataFrame(index = merged_df.index) # for this (1 - df1 / df2)
final_df2 = pd.DataFrame(index = merged_df.index) # for this ln(df1 / df2)
for column in merged_df.columns:
    # Get initial letter from the column names
    i = column[0]
    # filter only the columns that start with the same letter
    df_test = merged_df[merged_df.columns[merged_df.columns.str.startswith(i)]]
    #If only one column, add that to the dataframes
    if df_test.shape[1] == 1:
        final_df1 = final_df1.merge(df_test, right_index = True, left_index = True, how = 'outer')
        final_df2 = final_df2.merge(df_test, right_index = True, left_index = True, how = 'outer')
    #If two columns do the calculations
    else:
        final_df1.loc[:,i] = np.where((df_test[df_test.columns[0]].isnull())|(df_test[df_test.columns[1]].isnull()), 
        np.nan,  (1- df_test[df_test.columns[0]]) / df_test[df_test.columns[1]])
        final_df2.loc[:,i] = np.where((df_test[df_test.columns[0]].isnull())|(df_test[df_test.columns[1]].isnull()), 
        np.nan,  np.log(df_test[df_test.columns[0]] / df_test[df_test.columns[1]]))

#Adjust the names of the columns
final_df1.columns = final_df1.columns.str[0]
final_df2.columns = final_df2.columns.str[0]

print(final_df1.head(10), final_df2.head(10))

为此:

    simple_ret_df.head(6) 

           A          B         C          D
0       NaN  18.114421       NaN        NaN
1       NaN        NaN       NaN        NaN
2       NaN        NaN -0.127437   8.393440
3       NaN  11.026463 -1.390442  32.881258
4       NaN        NaN       NaN   4.128745
5 -1.693484   7.462012       NaN  32.796018 

 log_ret_df.head(6) 
           A          B         C          D
0       NaN  18.114421       NaN        NaN
1       NaN        NaN       NaN        NaN
2       NaN        NaN -1.771471   8.393440
3       NaN  11.026463  0.441823  32.881258
4       NaN        NaN       NaN   4.128745
5  0.578294   7.462012       NaN  32.796018

答案 1 :(得分:0)

simple_ret_df = df1.combine(df2, lambda s1, s2: 1-s1/s2)
log_ret_df = df1.combine(df2, lambda s1, s2: np.log(s1/s2))