我有一些包含布尔值的pandas系列,另一个包含与布尔系列具有相同索引的数据值的系列(自1960年以后的一点点以来的月度数据)。我尝试创建一个DataFrame,其中列名称是True
中找到boolean_array
值的日期,列本身包含idx-offset
的值窗口到idx+offset
。但是,如果偏移量会产生越界误差,我想用NaN
s填充该列。
df = get_data() # DataFrame w/ time series data
boolean_array = create_bool_array() # time series of booleans
data_dict = {}
offset = 3
for val in np.where(boolean_array == True)[0]:
idx = np.asscalar(val)
dt = df.index[idx]
if (idx - offset < 0):
# pad w/ (offset - idx) NaNs at beginning of col
if (idx + offset) > len(log_stock_returns):
# pad w/ (offset + idx) - len(log_stock_returns) NaNs at end of col
# This what I can use assuming there are no out of bounds errors
data_dict[dt] = df['data_column'][idx-offset:idx+offset].values
在pandas或numpy中有一种简单的方法吗?
编辑:使用
输入和输出示例df: boolean_array:
Date data_column Date Value
--------------------------- -------------------
2013-01-01 55.0 2013-01-01 False
2013-02-01 57.0 2013-02-01 True
2013-03-01 52.0 2013-03-01 False
2013-04-01 56.0 2013-04-01 False
2013-05-01 59.0 2013-05-01 False
2013-06-01 61.0 2013-06-01 False
2013-07-01 63.0 2013-07-01 True
2013-08-01 66.0 2013-08-01 True
2013-09-01 67.0 2013-09-01 False
2013-10-01 67.0 2013-10-01 False
2013-11-01 69.0 2013-11-01 True
2013-12-01 70.0 2013-12-01 False
data_dict (output) with offset = 3
key: 2013-02-01, value: [NaN, NaN, 55.0, 57.0, 52.0, 56.0, 59.0]
key: 2013-07-01, value: [56.0, 59.0, 61.0, 63.0, 66.0, 67.0, 67.0]
key: 2013-08-01, value: [59.0, 61.0, 63.0, 66.0, 67.0, 67.0, 69.0]
key: 2013-11-01, value: [66.0, 67.0, 67.0, 69.0, 70.0, NaN, NaN]
答案 0 :(得分:1)
您可以使用concat并将轴设置为1 ...
df = pd.concat([df, boolean_array], axis=1)
答案 1 :(得分:1)
我不知道这是否是最好的方法,但它有效(Python 2.7)......
import pandas as pd
import numpy as np
from cStringIO import StringIO
PI_KWARGS = dict(freq='M', periods=7)
tseries_data = '''2013-01-01 55.0
2013-02-01 57.0
2013-03-01 52.0
2013-04-01 56.0
2013-05-01 59.0
2013-06-01 61.0
2013-07-01 63.0
2013-08-01 66.0
2013-09-01 67.0
2013-10-01 67.0
2013-11-01 69.0
2013-12-01 70.0'''
bool_col = '''2013-01-01 False
2013-02-01 True
2013-03-01 False
2013-04-01 False
2013-05-01 False
2013-06-01 False
2013-07-01 True
2013-08-01 True
2013-09-01 False
2013-10-01 False
2013-11-01 True
2013-12-01 False'''
df = pd.read_csv(StringIO(tseries_data), index_col=0, parse_dates=True, sep='\s+', header=None, names=['Date', 'Data'])
bools = pd.read_csv(StringIO(bool_col), index_col=0, parse_dates=True, sep='\s+', header=None, names=['Date', 'Data'])
dates = bools.where(bools).dropna().index
def make_back_datetime_index(time, nmonths=3, **pi_kwargs):
month = time.month - nmonths % 12
is_month = month > 0
nyears = int(nmonths) / 12 + (not is_month)
month = (is_month and month) or 12 + month
start = pd.datetime(time.year - nyears, month, time.day)
return pd.PeriodIndex(start=start, **pi_kwargs).to_datetime()
data_dict = {date : df.ix[make_back_datetime_index(date, **PI_KWARGS)] for date in dates}