在熊猫中选择部分数据时遇到问题

时间:2019-11-21 13:00:14

标签: python pandas datetime average

我有以下数据库是从csv文件中用熊猫提取的:

df1=pd.read_csv(path,parse_dates=True) 

df1的打印结果为:

                  control      Avg_return  
2019-09-07          True            0    
2019-06-06          True            0
2019-02-19          True            0
2019-01-17          True            0
2018-12-20          True            0
2018-11-27          True            0
2018-10-12          True            0
   ...              ...            ...

加载2个csv文件后

df2=pd.read_csv(path,parse_dates=True)

df2的打印结果为:

                  return
2010-01-01          NaN
2010-04-01     0.010920
2010-05-01    -0.004404
2010-06-01    -0.025209
2010-07-01    -0.023280
   ...            ...

我的代码的目标是:

  1. 从df1开始约会
  2. 从第1点算起的日期减去6天。
  3. 从第1点算起的日期减去244天。
  4. 在df2中获取这两个日期以来的所有回报
  5. 计算这些回报的平均值,并将其存储在Avg_return中

我这样做了:

for i in range(0,df1_row):                   
#I go through my data df1 

    if (control.iloc[i]==True):                        
#I check if control_1 is true

      date_1=df1.index[i]-pd.to_timedelta(6, unit='d')    
# I remove 6 days from my date

      date_2=df1.index[i]-pd.to_timedelta(244, unit='d')  
# I remove 244 days from my date

      df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()

# I want to make the mean of the return between my date-6 days and my date-244 days

不幸的是,它给了我这个错误:

df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
                                        ^
SyntaxError: invalid syntax

有人可以帮助我吗? :)

3 个答案:

答案 0 :(得分:0)

这是一种不遍历所有行的方法:

# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)    

df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d') 
df1['date_2'] = df1.index  - pd.to_timedelta(244, unit='d') 

df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)

答案 1 :(得分:0)

以下内容看起来有些丑陋,但我认为它可以工作:)

虚拟df:

import numpy as np
import pandas as pd    

cols = ['date', 'control', 'Avg_return']
data = [
    [pd.to_datetime('2019-09-07'), True, 0], 
    [pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)

cols2 = ['date', 'return']
data2 = [
    [pd.to_datetime('2010-01-01'), np.nan], 
    [pd.to_datetime('2010-04-01'), 0.010920], 
    [pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)

起草的解决方案:

df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)

for i in range(0, df1.shape[0]):
    for j in range(0, df2.shape[0]):
        if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
               df1['Avg_return'].iloc[i] = (
                   df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
               ).mean()
        elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
               df1['Avg_return'].iloc[i] = (
                   df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
               ).mean()

输出:

    date    control Avg_return  date_minus_6    date_minus_244
0   2019-09-07  True    1.0 2019-09-01  2019-01-06
1   2019-06-06  True    0.0 2019-05-31  2018-10-05

答案 2 :(得分:0)

import csv
import pandas as pd

df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)

df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
    if df1['control'][i] == True:
        date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
        date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
        #I'm not sure if average_return has the correct condition, but adjust as you see fit
        df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] -  df2[df2['date'] > date_2]['return']).mean()
print df1