数据帧Dict的Pandas面板返回NaNs

时间:2016-03-01 00:10:03

标签: python pandas dataframe panel nan

我有一组DataFrames,我试图变成一个Panel。 这是我的代码:

# OPEN THE FILES INTO DATAFRAMES
filenames = ['Yahoo_2016-01-17.csv', 'Yahoo_2016-01-18.csv',
    'Yahoo_2016-01-19.csv','Yahoo_2016-01-23.csv','Yahoo_2016-01-27.csv',     
    'Yahoo_2016-02-05.csv', 'Yahoo_2016-02-06.csv', 'Yahoo_2016-02-09.csv',     
    'Yahoo_2016-02-11.csv', 'Yahoo_2016-02-13.csv', 'Yahoo_2016-02-15.csv', 
    'Yahoo_2016-02-16.csv', 'Yahoo_2016-02-29.csv']

dates = np.array(['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23', 
    '2016-01-27', '2016-02-05', '2016-02-06','2016-02-09', 
    '2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
    '2016-02-29']).astype('datetime64[D]')

filepath = '/Users/RickS/Documents/Investing/Stock_files/GENERAL/'

dfs = [pd.read_csv(filepath+f) for f in filenames]

# Panel not working...
panel = pd.Panel(dict([(date, df) for date in dates for df in dfs]))
panel.swapaxes('major','minor')

但是当我尝试阅读面板时,每个数据框中的所有值都变成了NaN:

Data is NaNs

当我单独查看数据框时,它们看起来都很好。 这是导入df的csv文件之一: example_csv_file

有一点需要注意,可能(或可能不)重要的是每个数据帧的dtypes都不完全相同:

In [24]: dfs[1].dtypes
Out[24]: 
Name                          object
Symbol                        object
Previous_Close               float64
Average_Daily_Volume           int64
Change_&_Percent_Change       object
Earnings/Share               float64
EPS_Estimate_Current_Year    float64
EPS_Estimate_Next_Quarter    float64
EPS_Estimate_Next_Year       float64
52-week_Low                  float64
52-week_High                 float64
EBITDA                        object
200-day_Moving_Average       float64
P/E_Ratio                    float64
PEG_Ratio                    float64
Short_Ratio                  float64
1_yr_Target_Price            float64
52-week_Range                 object
Date                          object
dtype: object

我做错了什么?

1 个答案:

答案 0 :(得分:1)

包含所有NaN的空面板的原因是您的dates numpy数组当前存储为datetime64类型。显然,pandas面板对象不适用于底层字典键。

只需删除astype,甚至可以使用将日期作为字符串键的列表或元组。但由于字典密钥是按天计算的,因此每个字典密钥对于您的面板需求都是唯一的。

dates = np.array(['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23', 
                  '2016-01-27', '2016-02-05', '2016-02-06','2016-02-09', 
                  '2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
                  '2016-02-29'])

dates = ['2016-01-17', '2016-01-18', '2016-01-19', '2016-01-23', 
         '2016-01-27', '2016-02-05', '2016-02-06','2016-02-09', 
         '2016-02-11', '2016-02-13', '2016-02-15', '2016-02-16',
         '2016-02-29']

然而,这带来了我早先的发现。目前,dict()函数中的列表理解将返回仅最后数据框的面板,重复13次。作为下面的列表理解的原因返回dfs列表和dates数组之间的总组合集,其长度等于两个集合的乘积:13 X 13(即,交叉连接/笛卡尔连接)。输出如下:

[(date, df) for date in dates for df in dfs]

dict()应用于上方后,您会强制使用13个唯一dates来携带上一个df的值,实际上是拉入最后一个组合配对

考虑使用zip()迭代两个集合中的每个项目:

dfDict = {}
for f,d in zip(filenames, dates):    
    dfDict[d] = pd.read_csv(filepath+f)    

panel = pd.Panel(dfDict)

或更短的:

dfs = [pd.read_csv(filepath+f) for f in filenames] 
panel = pd.Panel(dict([i for i in zip(dates, dfs)]))