我有像这样的DataFrame
OPEN HIGH LOW CLOSE VOL
2012-01-01 19:00:00 449000 449000 449000 449000 1336303000
2012-01-01 20:00:00 NaN NaN NaN NaN NaN
2012-01-01 21:00:00 NaN NaN NaN NaN NaN
2012-01-01 22:00:00 NaN NaN NaN NaN NaN
2012-01-01 23:00:00 NaN NaN NaN NaN NaN
...
OPEN HIGH LOW CLOSE VOL
2013-04-24 14:00:00 11700000 12000000 11600000 12000000 20647095439
2013-04-24 15:00:00 12000000 12399000 11979000 12399000 23997107870
2013-04-24 16:00:00 12399000 12400000 11865000 12100000 9379191474
2013-04-24 17:00:00 12300000 12397995 11850000 11850000 4281521826
2013-04-24 18:00:00 11850000 11850000 10903000 11800000 15546034128
我需要根据此规则填写NaN
当OPEN,HIGH,LOW,CLOSE为NaN时,
否则保持NaN
答案 0 :(得分:1)
由于其他两个答案都不起作用,这里有一个完整的答案。
我在这里测试了两种方法。第一个是基于working4coin对hd1的答案的评论,第二个是慢速,纯粹的python实现。很明显,python实现应该更慢,但我决定用两种方法来确定并量化结果。
def nans_to_prev_close_method1(data_frame):
data_frame['volume'] = data_frame['volume'].fillna(0.0) # volume should always be 0 (if there were no trades in this interval)
data_frame['close'] = data_frame.fillna(method='pad') # ie pull the last close into this close
# now copy the close that was pulled down from the last timestep into this row, across into o/h/l
data_frame['open'] = data_frame['open'].fillna(data_frame['close'])
data_frame['low'] = data_frame['low'].fillna(data_frame['close'])
data_frame['high'] = data_frame['high'].fillna(data_frame['close'])
方法1完成c中的大部分繁重工作(在熊猫代码中),因此应该非常快。
慢速,python方法(方法2)如下所示
def nans_to_prev_close_method2(data_frame):
prev_row = None
for index, row in data_frame.iterrows():
if np.isnan(row['open']): # row.isnull().any():
pclose = prev_row['close']
# assumes first row has no nulls!!
row['open'] = pclose
row['high'] = pclose
row['low'] = pclose
row['close'] = pclose
row['volume'] = 0.0
prev_row = row
测试两者的时间安排:
df = trades_to_ohlcv(PATH_TO_RAW_TRADES_CSV, '1s') # splits raw trades into secondly candles
df2 = df.copy()
wrapped1 = wrapper(nans_to_prev_close_method1, df)
wrapped2 = wrapper(nans_to_prev_close_method2, df2)
print("method 1: %.2f sec" % timeit.timeit(wrapped1, number=1))
print("method 2: %.2f sec" % timeit.timeit(wrapped2, number=1))
结果是:
method 1: 0.46 sec
method 2: 151.82 sec
显然,方法1要快得多(大约快330倍)。
答案 1 :(得分:0)
This说明pandas'缺少数据行为。你正在寻找的咒语是fillna方法,它取一个值:
In [1381]: df2
Out[1381]:
one two three four five timestamp
a NaN 1.138469 -2.400634 bar True NaT
c NaN 0.025653 -1.386071 bar False NaT
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00
h NaN -1.157886 -0.551865 bar False NaT
In [1382]: df2.fillna(0)
Out[1382]:
one two three four five timestamp
a 0.000000 1.138469 -2.400634 bar True 1970-01-01 00:00:00
c 0.000000 0.025653 -1.386071 bar False 1970-01-01 00:00:00
e 0.863937 0.252462 1.500571 bar True 2012-01-01 00:00:00
f 1.053202 -2.338595 -0.374279 bar True 2012-01-01 00:00:00
h 0.000000 -1.157886 -0.551865 bar False 1970-01-01 00:00:00
你甚至可以向前和向后传播它们:
In [1384]: df
Out[1384]:
one two three
a NaN 1.138469 -2.400634
c NaN 0.025653 -1.386071
e 0.863937 0.252462 1.500571
f 1.053202 -2.338595 -0.374279
h NaN -1.157886 -0.551865
In [1385]: df.fillna(method='pad')
Out[1385]:
one two three
a NaN 1.138469 -2.400634
c NaN 0.025653 -1.386071
e 0.863937 0.252462 1.500571
f 1.053202 -2.338595 -0.374279
h 1.053202 -1.157886 -0.551865
对于您的具体情况,我认为您需要这样做:
df['VOL'].fillna(0)
df.fillna(df['CLOSE'])
答案 2 :(得分:0)
以下是通过屏蔽
的方法模拟带有一些孔的框架(A是您的“关闭”字段)
In [20]: df = DataFrame(randn(10,3),index=date_range('20130101',periods=10,freq='min'),
columns=list('ABC'))
In [21]: df.iloc[1:3,:] = np.nan
In [22]: df.iloc[5:8,1:3] = np.nan
In [23]: df
Out[23]:
A B C
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362
2013-01-01 00:01:00 NaN NaN NaN
2013-01-01 00:02:00 NaN NaN NaN
2013-01-01 00:03:00 1.788240 -0.593195 0.059606
2013-01-01 00:04:00 1.097781 0.835491 -0.855468
2013-01-01 00:05:00 0.753991 NaN NaN
2013-01-01 00:06:00 -0.456790 NaN NaN
2013-01-01 00:07:00 -0.479704 NaN NaN
2013-01-01 00:08:00 1.332830 1.276571 -0.480007
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401
我们都是Nan
In [24]: mask_0 = pd.isnull(df).all(axis=1)
In [25]: mask_0
Out[25]:
2013-01-01 00:00:00 False
2013-01-01 00:01:00 True
2013-01-01 00:02:00 True
2013-01-01 00:03:00 False
2013-01-01 00:04:00 False
2013-01-01 00:05:00 False
2013-01-01 00:06:00 False
2013-01-01 00:07:00 False
2013-01-01 00:08:00 False
2013-01-01 00:09:00 False
Freq: T, dtype: bool
我们想宣传A
In [26]: mask_fill = pd.isnull(df['B']) & pd.isnull(df['C'])
In [27]: mask_fill
Out[27]:
2013-01-01 00:00:00 False
2013-01-01 00:01:00 True
2013-01-01 00:02:00 True
2013-01-01 00:03:00 False
2013-01-01 00:04:00 False
2013-01-01 00:05:00 True
2013-01-01 00:06:00 True
2013-01-01 00:07:00 True
2013-01-01 00:08:00 False
2013-01-01 00:09:00 False
Freq: T, dtype: bool
首先传播
In [28]: df.loc[mask_fill,'C'] = df['A']
In [29]: df.loc[mask_fill,'B'] = df['A']
填写0的
In [30]: df.loc[mask_0] = 0
完成
In [31]: df
Out[31]:
A B C
2013-01-01 00:00:00 -0.486149 0.156894 -0.272362
2013-01-01 00:01:00 0.000000 0.000000 0.000000
2013-01-01 00:02:00 0.000000 0.000000 0.000000
2013-01-01 00:03:00 1.788240 -0.593195 0.059606
2013-01-01 00:04:00 1.097781 0.835491 -0.855468
2013-01-01 00:05:00 0.753991 0.753991 0.753991
2013-01-01 00:06:00 -0.456790 -0.456790 -0.456790
2013-01-01 00:07:00 -0.479704 -0.479704 -0.479704
2013-01-01 00:08:00 1.332830 1.276571 -0.480007
2013-01-01 00:09:00 -0.759806 -0.815984 2.699401