多维Pandas数据帧

时间:2017-05-26 09:22:11

标签: python list pandas scikit-learn

我刚开始学习机器学习和Scikit。我一直在观看一个教程,其中该人使用Quandl来获取谷歌股票价格的数据。据我所研究,Quandl.get返回pandas数据帧。令我对这个数据帧感到困惑的是,一段代码是在数据帧的第二维中添加列,而在另一行上,教师使用数据帧的第一维访问同一列。怎么可能?这个数据框发生了什么?

df = quandl.get('WIKI/GOOGL')

df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

df['HCL_PCT'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] # how is df['Adj. Open'] working?? Wasn't 'Adj. Open' added in the second dimension of the dataframe in the second line of the code above??

我的目标是学习Tensorflow,并在深入TensorFlow之前对机器学习俚语和概念有一点了解。

2 个答案:

答案 0 :(得分:0)

我添加df.head()来写输出以显示数据:

#read data
df = quandl.get('WIKI/GOOGL')
print (df.head())
              Open    High     Low    Close      Volume  Ex-Dividend  \
Date                                                                   
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0   
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0   
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High   Adj. Low  Adj. Close  \
Date                                                                   
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842   
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689   
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377   
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363   
2004-08-25          1.0  52.542193  54.167209  52.100830   53.164113   

            Adj. Volume  
Date                     
2004-08-19   44659000.0  
2004-08-20   22834300.0  
2004-08-23   18256100.0  
2004-08-24   15247300.0  
2004-08-25    9188600.0  
#select data by columns (filter) and set order of columns 
df = df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]
print (df.head())
            Adj. Open  Adj. High   Adj. Low  Adj. Close  Adj. Volume
Date                                                                
2004-08-19  50.159839  52.191109  48.128568   50.322842   44659000.0
2004-08-20  50.661387  54.708881  50.405597   54.322689   22834300.0
2004-08-23  55.551482  56.915693  54.693835   54.869377   18256100.0
2004-08-24  55.792225  55.972783  51.945350   52.597363   15247300.0
2004-08-25  52.542193  54.167209  52.100830   53.164113    9188600.0

#count data - select by columns
df['HCL_PCT'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']
print (df.head())
            Adj. Open  Adj. High   Adj. Low  Adj. Close  Adj. Volume   HCL_PCT
Date                                                                          
2004-08-19  50.159839  52.191109  48.128568   50.322842   44659000.0  0.003250
2004-08-20  50.661387  54.708881  50.405597   54.322689   22834300.0  0.072270
2004-08-23  55.551482  56.915693  54.693835   54.869377   18256100.0 -0.012279
2004-08-24  55.792225  55.972783  51.945350   52.597363   15247300.0 -0.057264
2004-08-25  52.542193  54.167209  52.100830   53.164113    9188600.0  0.011837

选择列Adj. Close

print (df['Adj. Close'])
Date
2004-08-19     50.322842
2004-08-20     54.322689
2004-08-23     54.869377
2004-08-24     52.597363
2004-08-25     53.164113
2004-08-26     54.122070
2004-08-27     53.239345
2004-08-30     51.162935
2004-08-31     51.343492
2004-09-01     50.280210
2004-09-02     50.912161
2004-09-03     50.159839
2004-09-07     50.947269
2004-09-08     51.308384
2004-09-09     51.313400
2004-09-10     52.828075
2004-09-13     53.916435
2004-09-14     55.917612
2004-09-15     56.173402
2004-09-16     57.161452
2004-09-17     58.926902
2004-09-20     59.864797
2004-09-21     59.102444
2004-09-22     59.373280
2004-09-23     60.597057
2004-09-24     60.100525
2004-09-27     59.313094
2004-09-28     63.626409
2004-09-29     65.742942
2004-09-30     65.000651

2017-04-13    840.180000
2017-04-17    855.130000
2017-04-18    853.990000
2017-04-19    856.510000
2017-04-20    860.080000
2017-04-21    858.950000
2017-04-24    878.930000
2017-04-25    888.840000
2017-04-26    889.140000
2017-04-27    891.440000
2017-04-28    924.520000
2017-05-01    932.820000
2017-05-02    937.090000
2017-05-03    948.450000
2017-05-04    954.720000
2017-05-05    950.280000
2017-05-08    958.690000
2017-05-09    956.710000
2017-05-10    954.840000
2017-05-11    955.890000
2017-05-12    955.140000
2017-05-15    959.220000
2017-05-16    964.610000
2017-05-17    942.170000
2017-05-18    950.500000
2017-05-19    954.650000
2017-05-22    964.070000
2017-05-23    970.550000
2017-05-24    977.610000
2017-05-25    991.860000
Name: Adj. Close, Length: 3215, dtype: float64

编辑:

df = pd.DataFrame({'A':[1,2,3],
                   'D':[4,5,6],
                   'B':[7,8,9],
                   'F':[1,3,5],
                   'C':[5,3,6]})

print (df)
   A  B  C  D  F
0  1  7  5  4  1
1  2  8  3  5  3
2  3  9  6  6  5

#select only columns A,B,C and return new dataframe in new order of columns
df1 = df[['A','B','C']]
print (df1)
   A  B  C
0  1  7  5
1  2  8  3
2  3  9  6

#select only columns A,B,C and return new dataframe in new order of columns
df2 = df[['C','A','B']]
print (df2)
   C  A  B
0  5  1  7
1  3  2  8
2  6  3  9

答案 1 :(得分:0)

索引:索引或类似数组

在Dataframe结构中,使用索引获取列,使用数组或多个队列,相当于df [:,[]](所有选中元素,列元素切片访问)