我有一个DataFrame
,我想根据特定列的值以某种方式添加新列,该列的结果取决于另一个 DataFrame
中包含的数据
更具体地说,我有
df_original =
Crncy Spread Duration
0 EUR 100 1.2
1 nan nan nan
2 100 3.46
3 CHF 200 2.5
4 USD 50 5.0
...
df_interpolation =
CRNCY TENOR Adj_EUR Adj_USD
0 EUR 1 10 20
1 EUR 2 20 30
2 EUR 5 30 40
3 EUR 7 40 50
...
10 CHF 1 50 10
11 CHF 2 60 20
12 CHF 5 70 30
...
,现在想根据标准使用Adj_EUR
和Adj_USD
的值,为每行将df_original
和Crncy
列添加到Duration
线性插值。
因此,对于每个可用的{{,我们想使用TENOR
中的Adj_USD
和Adj_EUR
/ df_interpolation
和Duration
中的df_original
1}},以形成插值。
例如使用Crncy
中的optimize
-程序包的伪代码:
scipy
屈服
from scipy import optimize
""" Do this for both 'Adj_EUR' and 'Adj_USD' """
# For 'Adj_EUR'
for curr, df in df_original.groupby('Crncy'):
x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
y_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])
""" Linear fit """
z_linear = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data.ravel())[0]
""" Somehow add the values back to df_original in a new column """
df['Adj_EUR'] = z_linear[0] + z_linear[1] * df['Duration']
关于如何执行此操作的任何线索?
很明显
答案 0 :(得分:1)
假设我们有SET 1: -3 1 2 0 4
SET 2: 0 1 2 3 4 5
SET 4: 29 39 0 1 3
和df1
df2
将>>> df1
Crncy Spread Duration
0 EUR 100 1.2
1 CHF 200 2.5
>>> df2
CRNCY TENOR Adj_EUR Adj_USD
0 EUR 1 10 20
1 EUR 2 20 30
2 EUR 5 30 40
3 EUR 7 40 50
4 CHF 1 50 10
5 CHF 2 60 20
6 CHF 5 70 30
和df1
转换成相似的数据帧
df2
现在连续df1['Adj_EUR'] = np.nan
df1['Adj_USD'] = np.nan
df1['left'] = 1
>>> df1
Crncy Spread Duration Adj_EUR Adj_USD left
0 EUR 100 1.2 NaN NaN 1
1 CHF 200 2.5 NaN NaN 1
df2 = df2.rename(columns={'CRNCY': 'Crncy', 'TENOR': 'Duration'})
df2['Spread'] = np.nan
df2['left'] = 0
>>> df2
Crncy Duration Adj_EUR Adj_USD Spread left
0 EUR 1 10 20 NaN 0
1 EUR 2 20 30 NaN 0
2 EUR 5 30 40 NaN 0
3 EUR 7 40 50 NaN 0
4 CHF 1 50 10 NaN 0
5 CHF 2 60 20 NaN 0
6 CHF 5 70 30 NaN 0
和df1
行方向。
df2
然后使用df3 = pd.concat([df1, df2], ignore_index=True, sort=False).sort_values(['Crncy', 'Duration'])
>>> df3
Crncy Spread Duration Adj_EUR Adj_USD left
6 CHF NaN 1.0 50.0 10.0 0
7 CHF NaN 2.0 60.0 20.0 0
1 CHF 200.0 2.5 NaN NaN 1
8 CHF NaN 5.0 70.0 30.0 0
2 EUR NaN 1.0 10.0 20.0 0
0 EUR 100.0 1.2 NaN NaN 1
3 EUR NaN 2.0 20.0 30.0 0
4 EUR NaN 5.0 30.0 40.0 0
5 EUR NaN 7.0 40.0 50.0 0
插入每列的NaN
值,然后删除不必要的列:
Duration
希望这会有所帮助。
答案 1 :(得分:0)
所以,这是我一直在寻找的东西
from scipy import optimize
for curr, df in df_original.groupby('Crncy'):
x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
y_data_usd = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_USD'])
y_data_eur = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])
""" Linear fit """
if x_data.size > 0:
z_linear_usd = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data_usd.ravel())[0]
z_linear_eur = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data_eur.ravel())[0]
temp_df = df.copy()[['Crncy','Duration']]
temp_df['Adj_USD'] = z_linear_usd[0] + z_linear_usd[1] * temp_df['OAD']
temp_df['Adj_EUR'] = z_linear_eur[0] + z_linear_eur[1] * temp_df['OAD']
temp_interpolation_lst.append(temp_df)
del temp_df
temp_interpolation_df = pd.concat(temp_interpolation_lst)
temp_interpolation_df.sort_index(axis=0, inplace=True)
""" Add back to original DataFrame - as the indices are the same and matching..."""
df_original = df_original .join(other=temp_interpolation_df[['Adj_USD', 'Adj_EUR']], how='left')
它不像我所希望的那样干净,但仍然可以正常工作...