打印数据帧,df.to_pickle时不一样。值变为NaN

时间:2017-03-23 19:13:24

标签: pandas dataframe pickle nan

所以我有一个数据帧字典stocks 我通过插入股票的股票代码来调用股票的数据框

stocks['OPK']调出股票' OPK' 输出是:

stocks['OPK']
            Open  High   Low  Close      Volume  Adj Close  
Date                                                                     
2010-01-04  1.80  1.97  1.76   1.95    234500.0       1.95          
2010-01-05  1.64  1.95  1.64   1.93    135800.0       1.93     
2010-01-06  1.90  1.92  1.77   1.79    546600.0       1.79   -  
2010-01-07  1.79  1.94  1.76   1.92    138700.0       1.92     

编辑:我已经添加了代码来构建我正在玩的同一个面板,所以那些试图解决我的问题的人,在测试他们的想法时不会遇到问题。 / p>

Here is the code to get the Panel (for reproducibility)

import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re



startDate = '2010-01-01'
endDate = '2016-09-07'  
stocks_query = ['AAPL','OPK']


stocks = web.DataReader(stocks_query, data_source='yahoo',
          start=startDate, end=endDate)
stocks = stocks.swapaxes('items','minor_axis')`

导致输出

Dimensions: 2 (items) x 1682 (major_axis) x 6 (minor_axis)
Items axis: AAPL to OPK
Major_axis axis: 2010-01-04 00:00:00 to 2016-09-07 00:00:00
Minor_axis axis: Open to Adj Close

我通过功能添加自定义列,然后将其保存到pickle。添加列后,当我打印数据帧时,我发现没有问题。但是,当我将它保存到pickle并加载它时,六个新创建的列中的两个最终会丢失值。我希望能够把它看成一个泡菜,所以我不必继续重新创建列。但我也想通过一个函数来做,因为我希望自动创建列。

这是我的代码(为了简明起见,我删除了一些部分):

import numpy as np
import pandas as pd
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime as dt
import re

startDate = '2010-01-01'
endDate = dt.date.today()
stocks_query = ('AAPL','OPK')
source = 'yahoo'
columns =['Open', 'High', 'Low'......'p_changed']


def load_data(stocks_query, data_source, start, end):


    file_extension = '_'.join(stocks_query)

    stocks = pd.read_pickle('C:\\Users\Moondra\MachineLearning\\Stock_Market_Predictor-master\{}.pkl'. \
                             format(file_extension))
    try:
        stocks[stocks_query[0]]['log return']    #this checks if the customized
                                                    columns have been added

   except KeyError:
        print('There was an error, so we adding the columns')
        stocks =new_columns(stocks, columns)  #calls the function to add the columns
        stocks.to_pickle('C:\\Users\Moondra\MachineLearning\Stock_Market_Predictor-master\{}.pkl'.format\
                     (file_extension))  # saves to a pickle file
  return stocks



def new_columns(stocks, columns):   #this is the function that adds new columns
    stocks =stocks.reindex_axis([columns], 'minor_axis')
    for i in stocks:
    stocks[i]['log_return'] = np.log(stocks[i]['Close']/(stocks[i]['Close'].shift(1)))
    stocks [i] ['close_open'] = (stocks[i].Open - stocks[i].Close.shift(1))

    stocks[i]['30_Avg_Vol'] = stocks[i] ['Volume'].rolling(min_periods =15, window=30).mean()

    stocks[i]['changed'] = stocks[i]['close_open'] * stocks[i]['close_open'].shift(-1) < 0
    stocks[i]['p_changed'] = (stocks[i]['close_open'] + stocks[i]['close_open'].shift(-1) < stocks[i]['close_open'].shift(-1))\
                         &(stocks[i]['close_open']* stocks[i]['close_open'].shift(-1) < 0)


    return (stocks)

我遇到的问题是最后两列。 运行代码并输入stocks['OPK']后,我没有遇到任何问题。 我看到所有列都已添加,以及它们的值。 最后一列有点不同,因为它们返回布尔值,但没有异常。

以下是我的输出看起来(没有错误):

Date                  changed p_changed                  
2010-01-04            False     False  
2010-01-05            False     False  
2010-01-06            False     False  
2010-01-07            False     False  
2010-01-08            False     False  
2010-01-11            True     False  
2010-01-12            False     False  
2010-01-13            False     False  

然而,当我加载泡菜时,(注意在load_data功能中,我在添加列后立即将其保存为泡菜)并输入库存[&#39; OPK&#39 ;],最后两列仅显示NAN值。

                     changed  p_changed  
Date                                                        
2010-01-04           NaN           NaN        
2010-01-05           NaN           NaN       
2010-01-06           NaN           NaN      
2010-01-07           NaN           NaN    

不确定为什么会这样。我添加的其他列log_returns等没有错误。它只是最后两列,它们是布尔值。 我怀疑它是什么东西。

编辑:我也尝试在功能之外保存到pickle。但这个奇怪的&#34; Nan&#34;输出仍然保持不变。

1 个答案:

答案 0 :(得分:1)

Mate,这是一个解决方法。 Pandas Panel更像是一个麻烦而不是促进者。

使用此代码将您的stocks数据转换为普通的多索引Pandas数据框,并观察事情是否有效。

#use this to convert your Panel into multi-indexed pd.DataFrame
stocks_df = pd.concat([stocks[item] for item in  stocks.items],keys = stocks.items)

#a new_columns function (note that it's different from yours)
def new_columns(df):   #this is the function that adds new columns
        df.loc[:,'log_return'] = np.log(df['Close']/(df['Close'].shift(1)))
        df.loc[:,'close_open'] = (df.Open - df.Close.shift(1))

        df.loc[:,'30_Avg_Vol'] = df.loc[:,'Volume'].rolling(min_periods =15, window=30).mean()

        df.loc[:,'changed'] = df['close_open'] * df['close_open'].shift(-1) < 0
        df.loc[:,'p_changed'] = (df['close_open'] + df['close_open'].shift(-1) < df['close_open'].shift(-1))  & (df['close_open']* df['close_open'].shift(-1) < 0)
        return(df)


#here's how you would run it:
stocks_df = stocks_df.groupby(level=0).apply(new_columns)

#now I pickle it:
stocks_df.to_pickle("pickled_df.pkl")

#here I retrieve it.           
stocks_read = pd.read_pickle("pickled_df.pkl")    

In [41]: stocks_read.head()
Out[41]:
                       Open        High         Low       Close       Volume  \
     Date
AAPL 2010-01-04  213.429998  214.499996  212.380001  214.009998  123432400.0
     2010-01-05  214.599998  215.589994  213.249994  214.379993  150476200.0
     2010-01-06  214.379993  215.230000  210.750004  210.969995  138040000.0
     2010-01-07  211.750000  212.000006  209.050005  210.580000  119282800.0
     2010-01-08  210.299994  212.000006  209.060005  211.980005  111902700.0

                 Adj Close  log_return  close_open  30_Avg_Vol changed  \
     Date
AAPL 2010-01-04  27.727039         NaN         NaN         NaN   False
     2010-01-05  27.774976    0.001727    0.590000         NaN   False
     2010-01-06  27.333178   -0.016034    0.000000         NaN   False
     2010-01-07  27.282650   -0.001850    0.780005         NaN    True
     2010-01-08  27.464034    0.006626   -0.280006         NaN    True

                p_changed
     Date
AAPL 2010-01-04     False
     2010-01-05     False
     2010-01-06     False
     2010-01-07     False
     2010-01-08      True

看,如果你不使用Panel,那么事情就像魅力一样。