如果偏移太大,则用NaN填充pandas列

时间:2014-07-25 15:03:58

标签: python numpy pandas

我有一些包含布尔值的pandas系列,另一个包含与布尔系列具有相同索引的数据值的系列(自1960年以后的一点点以来的月度数据)。我尝试创建一个DataFrame,其中列名称是True中找到boolean_array值的日期,列本身包含idx-offset的值窗口到idx+offset。但是,如果偏移量会产生越界误差,我想用NaN s填充该列。

df = get_data() # DataFrame w/ time series data
boolean_array = create_bool_array() # time series of booleans

data_dict = {}
offset = 3

for val in np.where(boolean_array == True)[0]:
    idx = np.asscalar(val)
    dt = df.index[idx]
    if (idx - offset < 0):
        # pad w/ (offset - idx) NaNs at beginning of col
    if (idx + offset) > len(log_stock_returns):
        # pad w/ (offset + idx) - len(log_stock_returns) NaNs at end of col

    # This what I can use assuming there are no out of bounds errors
    data_dict[dt] = df['data_column'][idx-offset:idx+offset].values

在pandas或numpy中有一种简单的方法吗?

编辑:使用

输入和输出示例
df:                               boolean_array:
Date            data_column       Date          Value
---------------------------       -------------------
2013-01-01      55.0              2013-01-01    False
2013-02-01      57.0              2013-02-01    True
2013-03-01      52.0              2013-03-01    False
2013-04-01      56.0              2013-04-01    False
2013-05-01      59.0              2013-05-01    False
2013-06-01      61.0              2013-06-01    False
2013-07-01      63.0              2013-07-01    True
2013-08-01      66.0              2013-08-01    True
2013-09-01      67.0              2013-09-01    False
2013-10-01      67.0              2013-10-01    False
2013-11-01      69.0              2013-11-01    True
2013-12-01      70.0              2013-12-01    False

data_dict (output) with offset = 3
key: 2013-02-01, value: [NaN, NaN, 55.0, 57.0, 52.0, 56.0, 59.0]
key: 2013-07-01, value: [56.0, 59.0, 61.0, 63.0, 66.0, 67.0, 67.0]
key: 2013-08-01, value: [59.0, 61.0, 63.0, 66.0, 67.0, 67.0, 69.0]
key: 2013-11-01, value: [66.0, 67.0, 67.0, 69.0, 70.0, NaN, NaN]

2 个答案:

答案 0 :(得分:1)

您可以使用concat并将轴设置为1 ...

df = pd.concat([df, boolean_array], axis=1)

答案 1 :(得分:1)

我不知道这是否是最好的方法,但它有效(Python 2.7)......

import pandas as pd
import numpy as np
from cStringIO import StringIO


PI_KWARGS = dict(freq='M', periods=7)

tseries_data = '''2013-01-01      55.0
2013-02-01      57.0
2013-03-01      52.0
2013-04-01      56.0
2013-05-01      59.0
2013-06-01      61.0
2013-07-01      63.0
2013-08-01      66.0
2013-09-01      67.0
2013-10-01      67.0
2013-11-01      69.0
2013-12-01      70.0'''

bool_col = '''2013-01-01    False
 2013-02-01    True
 2013-03-01    False
 2013-04-01    False
 2013-05-01    False
 2013-06-01    False
 2013-07-01    True
 2013-08-01    True
 2013-09-01    False
 2013-10-01    False
 2013-11-01    True
 2013-12-01    False'''


df = pd.read_csv(StringIO(tseries_data), index_col=0, parse_dates=True, sep='\s+', header=None, names=['Date', 'Data'])
bools = pd.read_csv(StringIO(bool_col), index_col=0, parse_dates=True, sep='\s+', header=None, names=['Date', 'Data'])
dates = bools.where(bools).dropna().index

def make_back_datetime_index(time, nmonths=3, **pi_kwargs):

    month = time.month - nmonths % 12
    is_month = month > 0

    nyears = int(nmonths) / 12 + (not is_month)
    month = (is_month and month) or 12 + month
    start = pd.datetime(time.year - nyears, month, time.day)

    return pd.PeriodIndex(start=start, **pi_kwargs).to_datetime()

data_dict = {date : df.ix[make_back_datetime_index(date, **PI_KWARGS)] for date in dates}