所以我有一个数据帧字典stocks
我通过插入股票的股票代码来调用股票的数据框
stocks['OPK']
调出股票' OPK'
输出是:
stocks['OPK']
Open High Low Close Volume Adj Close
Date
2010-01-04 1.80 1.97 1.76 1.95 234500.0 1.95
2010-01-05 1.64 1.95 1.64 1.93 135800.0 1.93
2010-01-06 1.90 1.92 1.77 1.79 546600.0 1.79 -
2010-01-07 1.79 1.94 1.76 1.92 138700.0 1.92
编辑:我已经添加了代码来构建我正在玩的同一个面板,所以那些试图解决我的问题的人,在测试他们的想法时不会遇到问题。 / p>
Here is the code to get the Panel (for reproducibility)
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = '2016-09-07'
stocks_query = ['AAPL','OPK']
stocks = web.DataReader(stocks_query, data_source='yahoo',
start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')`
导致输出
Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close
我通过功能添加自定义列,然后将其保存到pickle。添加列后,当我打印数据帧时,我发现没有问题。但是,当我将它保存到pickle并加载它时,六个新创建的列中的两个最终会丢失值。我希望能够把它看成一个泡菜,所以我不必继续重新创建列。但我也想通过一个函数来做,因为我希望自动创建列。
这是我的代码(为了简明起见,我删除了一些部分):
import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re
startDate = '2010-01-01'
endDate = dt.date.today()
stocks_query = ('AAPL','OPK')
source = 'yahoo'
columns =['Open', 'High', 'Low'......'p_changed']
def load_data(stocks_query, data_source, start, end):
file_extension = '_'.join(stocks_query)
stocks = pd.read_pickle('C:\\Users\Moondra\MachineLearning\\Stock_Market_Predictor-master\{}.pkl'. \
format(file_extension))
try:
stocks[stocks_query[0]]['log return'] #this checks if the customized
columns have been added
except KeyError:
print('There was an error, so we adding the columns')
stocks =new_columns(stocks, columns) #calls the function to add the columns
stocks.to_pickle('C:\\Users\Moondra\MachineLearning\Stock_Market_Predictor-master\{}.pkl'.format\
(file_extension)) # saves to a pickle file
return stocks
def new_columns(stocks, columns): #this is the function that adds new columns
stocks =stocks.reindex_axis([columns], 'minor_axis')
for i in stocks:
stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
stocks [i] ['close_open'] = (stocks[i].Open - stocks[i].Close.shift(1))
stocks[i]['30_Avg_Vol'] = stocks[i] ['Volume'].rolling(min_periods =15, window=30).mean()
stocks[i]['changed'] = stocks[i]['close_open'] * stocks[i]['close_open'].shift(-1) < 0
stocks[i]['p_changed'] = (stocks[i]['close_open'] + stocks[i]['close_open'].shift(-1) < stocks[i]['close_open'].shift(-1))\
&(stocks[i]['close_open']* stocks[i]['close_open'].shift(-1) < 0)
return (stocks)
我遇到的问题是最后两列。
运行代码并输入stocks['OPK']
后,我没有遇到任何问题。
我看到所有列都已添加,以及它们的值。
最后一列有点不同,因为它们返回布尔值,但没有异常。
以下是我的输出看起来(没有错误):
Date changed p_changed
2010-01-04 False False
2010-01-05 False False
2010-01-06 False False
2010-01-07 False False
2010-01-08 False False
2010-01-11 True False
2010-01-12 False False
2010-01-13 False False
然而,当我加载泡菜时,(注意在load_data
功能中,我在添加列后立即将其保存为泡菜)并输入库存[&#39; OPK&#39 ;],最后两列仅显示NAN值。
changed p_changed
Date
2010-01-04 NaN NaN
2010-01-05 NaN NaN
2010-01-06 NaN NaN
2010-01-07 NaN NaN
不确定为什么会这样。我添加的其他列log_returns
等没有错误。它只是最后两列,它们是布尔值。
我怀疑它是什么东西。
编辑:我也尝试在功能之外保存到pickle。但这个奇怪的&#34; Nan&#34;输出仍然保持不变。
答案 0 :(得分:1)
Mate,这是一个解决方法。 Pandas Panel
更像是一个麻烦而不是促进者。
使用此代码将您的stocks
数据转换为普通的多索引Pandas数据框,并观察事情是否有效。
#use this to convert your Panel into multi-indexed pd.DataFrame
stocks_df = pd.concat([stocks[item] for item in stocks.items],keys = stocks.items)
#a new_columns function (note that it's different from yours)
def new_columns(df): #this is the function that adds new columns
df.loc[:,'log_return'] = np.log(df['Close']/(df['Close'].shift(1)))
df.loc[:,'close_open'] = (df.Open - df.Close.shift(1))
df.loc[:,'30_Avg_Vol'] = df.loc[:,'Volume'].rolling(min_periods =15, window=30).mean()
df.loc[:,'changed'] = df['close_open'] * df['close_open'].shift(-1) < 0
df.loc[:,'p_changed'] = (df['close_open'] + df['close_open'].shift(-1) < df['close_open'].shift(-1)) & (df['close_open']* df['close_open'].shift(-1) < 0)
return(df)
#here's how you would run it:
stocks_df = stocks_df.groupby(level=0).apply(new_columns)
#now I pickle it:
stocks_df.to_pickle("pickled_df.pkl")
#here I retrieve it.
stocks_read = pd.read_pickle("pickled_df.pkl")
In [41]: stocks_read.head()
Out[41]:
Open High Low Close Volume \
Date
AAPL 2010-01-04 213.429998 214.499996 212.380001 214.009998 123432400.0
2010-01-05 214.599998 215.589994 213.249994 214.379993 150476200.0
2010-01-06 214.379993 215.230000 210.750004 210.969995 138040000.0
2010-01-07 211.750000 212.000006 209.050005 210.580000 119282800.0
2010-01-08 210.299994 212.000006 209.060005 211.980005 111902700.0
Adj Close log_return close_open 30_Avg_Vol changed \
Date
AAPL 2010-01-04 27.727039 NaN NaN NaN False
2010-01-05 27.774976 0.001727 0.590000 NaN False
2010-01-06 27.333178 -0.016034 0.000000 NaN False
2010-01-07 27.282650 -0.001850 0.780005 NaN True
2010-01-08 27.464034 0.006626 -0.280006 NaN True
p_changed
Date
AAPL 2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 True
看,如果你不使用Panel,那么事情就像魅力一样。