python + pandas能够处理大量数据吗?

时间:2013-08-25 18:11:50

标签: python-2.7 numpy pandas

我有以下从网上下载的代码。我已经通过分解进行修改,以便在雅虎上更容易。

我的问题是关于python(一般)和python + pandas能够处理比我在这里尝试的更多数据的能力。当我运行此代码时,如果我计算了符号之间的所有相关性,它最终会阻塞(参见“它显示这个”部分)。如果我删除一些计算,它似乎没问题。我不确定什么是chocking,我认为这是熊猫?

分解这段代码的正确方法是什么,以便它不会失去其简洁性[而不是使用矢量化的循环],并且仍然能够处理更多的数据?我希望能够处理存储在文件中的10年1分钟数据,如果它甚至无法处理一年的每日数据,那么它将无法在该数据集上运行。

所以我的问题是:

修复此程序的正确方法是什么(希望我可以概括)以便它可以在DOW 30符号上运行?

import pandas
from matplotlib.pyplot import show, legend
from datetime import datetime
from matplotlib import finance
import numpy

# 2011 to 2012
start = datetime(2011, 01, 01)
end = datetime(2012, 01, 01)

symbolsAK = ["AA", "AXP", "BA", "BAC", "CAT",
             "CSCO", "CVX", "DD", "DIS", "GE", "HD",
             "HPQ", "IBM", "INTC", "JNJ", "JPM",
             "KO"]
symbolsMP = ["MCD", "MMM", "MRK", "MSFT", "PFE", "PG"]
#symbolsTX = ["T", "TRV", "UNH", "UTX", "VZ", "WMT", "XOM"]

symbols = symbolsAK
symbols = symbols + symbolsMP
#symbols = symbols + symbolsTX

quotesAK = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
            for symbol in symbolsAK]
quotesMP = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
            for symbol in symbolsMP]
#quotesTX = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
#            for symbol in symbolsTX]

quotes = quotesAK
quotes = quotes + quotesMP
#quotes = quotes + quotesTX

close = numpy.array([q.close for q in quotes]).astype(numpy.float)
dates = numpy.array([q.date for q in quotes])

data = {}

for i in xrange(len(symbols)):
   data[symbols[i]] = numpy.diff(numpy.log(close[i]))

df = pandas.DataFrame(data, index=dates[0][:-1], columns=symbols)


print df.corr()

它看起来像(某些)这个[缩短]

的输出
#           AA       AXP        BA       BAC       CAT
#AA   1.000000  0.768484  0.758264  0.737625  0.837643
#AXP  0.768484  1.000000  0.746898  0.760043  0.736337
#BA   0.758264  0.746898  1.000000  0.657075  0.770696
#BAC  0.737625  0.760043  0.657075  1.000000  0.657113
#CAT  0.837643  0.736337  0.770696  0.657113  1.000000

相反它显示了这个

<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, AA to PG
Data columns (total 23 columns):
AA      23  non-null values
AXP     23  non-null values
BA      23  non-null values
BAC     23  non-null values
CAT     23  non-null values
CSCO    23  non-null values
CVX     23  non-null values
DD      23  non-null values
DIS     23  non-null values
GE      23  non-null values
HD      23  non-null values
HPQ     23  non-null values
IBM     23  non-null values
INTC    23  non-null values
JNJ     23  non-null values
JPM     23  non-null values
KO      23  non-null values
MCD     23  non-null values
MMM     23  non-null values
MRK     23  non-null values
MSFT    23  non-null values
PFE     23  non-null values
PG      23  non-null values
dtypes: float64(23)

1 个答案:

答案 0 :(得分:5)

它认为这不是内存或速度问题,而只是pandas控制台输出格式的问题(请参阅http://pandas.pydata.org/pandas-docs/stable/basics.html#working-with-package-options

如果DataFrame太大而无法在控制台中显示,那么当您到达此处时,pandas将提供摘要视图(通过描述每列中有多少非空值)。在我的电脑上,这是最多20列和60行。但您可以更改此设置以显示更大的数据帧:

这是您获得的摘要视图:

In [2]: df.corr()
Out[2]: 
<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, AA to XOM
Data columns (total 30 columns):
AA      30  non-null values
AXP     30  non-null values
BA      30  non-null values
BAC     30  non-null values
CAT     30  non-null values
CSCO    30  non-null values
CVX     30  non-null values
DD      30  non-null values
DIS     30  non-null values
GE      30  non-null values
HD      30  non-null values
HPQ     30  non-null values
IBM     30  non-null values
INTC    30  non-null values
JNJ     30  non-null values
JPM     30  non-null values
KO      30  non-null values
MCD     30  non-null values
MMM     30  non-null values
MRK     30  non-null values
MSFT    30  non-null values
PFE     30  non-null values
PG      30  non-null values
T       30  non-null values
TRV     30  non-null values
UNH     30  non-null values
UTX     30  non-null values
VZ      30  non-null values
WMT     30  non-null values
XOM     30  non-null values
dtypes: float64(30)

您可以更改要显示的列数:

In [5]: pandas.options.display.max_columns = 50

现在您将显示整个数据框:

In [6]: df.corr()
Out[6]: 
            AA       AXP        BA       BAC       CAT      CSCO       CVX  \
AA    1.000000  0.768692  0.758487  0.738168  0.838511  0.584911  0.785955   
AXP   0.768692  1.000000  0.746401  0.760255  0.736557  0.553068  0.703163   
BA    0.758487  0.746401  1.000000  0.657093  0.770767  0.540786  0.721736   
BAC   0.738168  0.760255  0.657093  1.000000  0.657254  0.518776  0.620971   
CAT   0.838511  0.736557  0.770767  0.657254  1.000000  0.572002  0.798452   
CSCO  0.584911  0.553068  0.540786  0.518776  0.572002  1.000000  0.577381   
CVX   0.785955  0.703163  0.721736  0.620971  0.798452  0.577381  1.000000   
DD    0.851112  0.759933  0.760574  0.675753  0.851070  0.582742  0.803719   
DIS   0.751486  0.742574  0.782171  0.660317  0.746241  0.554223  0.713484   
GE    0.765963  0.755788  0.752602  0.699929  0.724883  0.576092  0.741695   
HD    0.614588  0.647022  0.661556  0.575308  0.621294  0.499147  0.647491   
HPQ   0.595110  0.509375  0.573672  0.453443  0.594590  0.427063  0.487018   
IBM   0.662302  0.635524  0.701338  0.501511  0.664953  0.525319  0.625367   
INTC  0.610633  0.587957  0.633674  0.463658  0.634462  0.543521  0.580799   
JNJ   0.676678  0.674556  0.668016  0.569427  0.677017  0.562765  0.707574   
JPM   0.799386  0.803539  0.695899  0.843875  0.726496  0.583126  0.701845   
KO    0.632974  0.649398  0.686937  0.504384  0.621104  0.496410  0.684645   
MCD   0.590209  0.619359  0.608288  0.482579  0.557322  0.467640  0.584303   
MMM   0.807796  0.760495  0.760528  0.674455  0.805890  0.598801  0.771571   
MRK   0.683526  0.675286  0.683141  0.594330  0.630897  0.528784  0.669215   
MSFT  0.708997  0.670527  0.675077  0.579668  0.672689  0.629810  0.676658   
PFE   0.692207  0.661628  0.661427  0.580332  0.653362  0.524557  0.695621   
PG    0.533456  0.638269  0.634056  0.461198  0.569150  0.500971  0.628252   
T     0.662901  0.658365  0.647409  0.585394  0.611656  0.454966  0.659306   
TRV   0.697297  0.690480  0.693580  0.692810  0.679448  0.550598  0.707205   
UNH   0.626418  0.645823  0.644480  0.577014  0.642021  0.502656  0.628023   
UTX   0.800084  0.770001  0.818340  0.650226  0.844137  0.611440  0.779919   
VZ    0.613422  0.613442  0.576083  0.536747  0.589583  0.472622  0.627931   
WMT   0.517511  0.575717  0.587670  0.479790  0.538195  0.515317  0.556602   
XOM   0.747023  0.699433  0.734805  0.598516  0.753005  0.581742  0.905136   

            DD       DIS        GE        HD       HPQ       IBM      INTC  \
AA    0.851112  0.751486  0.765963  0.614588  0.595110  0.662302  0.610633   
AXP   0.759933  0.742574  0.755788  0.647022  0.509375  0.635524  0.587957   
BA    0.760574  0.782171  0.752602  0.661556  0.573672  0.701338  0.633674   
BAC   0.675753  0.660317  0.699929  0.575308  0.453443  0.501511  0.463658   
CAT   0.851070  0.746241  0.724883  0.621294  0.594590  0.664953  0.634462   
CSCO  0.582742  0.554223  0.576092  0.499147  0.427063  0.525319  0.543521   
CVX   0.803719  0.713484  0.741695  0.647491  0.487018  0.625367  0.580799   
DD    1.000000  0.773421  0.768493  0.660224  0.587773  0.674010  0.627005   
DIS   0.773421  1.000000  0.768324  0.643008  0.609767  0.678413  0.607358   
GE    0.768493  0.768324  1.000000  0.649000  0.553156  0.656494  0.625745   
HD    0.660224  0.643008  0.649000  1.000000  0.459635  0.575951  0.572010   
HPQ   0.587773  0.609767  0.553156  0.459635  1.000000  0.582698  0.548928   
IBM   0.674010  0.678413  0.656494  0.575951  0.582698  1.000000  0.633732   
INTC  0.627005  0.607358  0.625745  0.572010  0.548928  0.633732  1.000000   
JNJ   0.714763  0.654975  0.683914  0.589519  0.494923  0.602186  0.571545   
JPM   0.767345  0.737792  0.795344  0.601889  0.521005  0.602322  0.569887   
KO    0.696257  0.656332  0.674888  0.631668  0.443318  0.694586  0.574671   
MCD   0.583090  0.569733  0.556076  0.608105  0.337828  0.569540  0.491635   
MMM   0.799806  0.775277  0.797455  0.654009  0.578911  0.676061  0.650945   
MRK   0.671173  0.690316  0.687744  0.574417  0.448651  0.627232  0.547941   
MSFT  0.703819  0.684609  0.679975  0.631967  0.521019  0.682591  0.662063   
PFE   0.690313  0.650876  0.706027  0.638876  0.474586  0.623725  0.550615   
PG    0.617922  0.611371  0.613507  0.556490  0.431871  0.610044  0.551303   
T     0.686551  0.669819  0.680358  0.597554  0.494590  0.678023  0.545211   
TRV   0.710612  0.710623  0.677900  0.624701  0.482071  0.589566  0.608157   
UNH   0.640953  0.651940  0.632988  0.612200  0.407039  0.611192  0.547778   
UTX   0.815454  0.786531  0.777018  0.673500  0.610108  0.748190  0.692028   
VZ    0.630868  0.617529  0.684984  0.567786  0.424424  0.586035  0.508896   
WMT   0.566875  0.581024  0.556110  0.692174  0.374181  0.489173  0.489745   
XOM   0.774908  0.720534  0.761815  0.639149  0.523942  0.675966  0.610824   

           JNJ       JPM        KO       MCD       MMM       MRK      MSFT  \
AA    0.676678  0.799386  0.632974  0.590209  0.807796  0.683526  0.708997   
AXP   0.674556  0.803539  0.649398  0.619359  0.760495  0.675286  0.670527   
BA    0.668016  0.695899  0.686937  0.608288  0.760528  0.683141  0.675077   
BAC   0.569427  0.843875  0.504384  0.482579  0.674455  0.594330  0.579668   
CAT   0.677017  0.726496  0.621104  0.557322  0.805890  0.630897  0.672689   
CSCO  0.562765  0.583126  0.496410  0.467640  0.598801  0.528784  0.629810   
CVX   0.707574  0.701845  0.684645  0.584303  0.771571  0.669215  0.676658   
DD    0.714763  0.767345  0.696257  0.583090  0.799806  0.671173  0.703819   
DIS   0.654975  0.737792  0.656332  0.569733  0.775277  0.690316  0.684609   
GE    0.683914  0.795344  0.674888  0.556076  0.797455  0.687744  0.679975   
HD    0.589519  0.601889  0.631668  0.608105  0.654009  0.574417  0.631967   
HPQ   0.494923  0.521005  0.443318  0.337828  0.578911  0.448651  0.521019   
IBM   0.602186  0.602322  0.694586  0.569540  0.676061  0.627232  0.682591   
INTC  0.571545  0.569887  0.574671  0.491635  0.650945  0.547941  0.662063   
JNJ   1.000000  0.649433  0.661615  0.591725  0.736881  0.720435  0.606554   
JPM   0.649433  1.000000  0.584480  0.520379  0.764575  0.632774  0.665440   
KO    0.661615  0.584480  1.000000  0.659553  0.684177  0.685925  0.630570   
MCD   0.591725  0.520379  0.659553  1.000000  0.639054  0.610580  0.569149   
MMM   0.736881  0.764575  0.684177  0.639054  1.000000  0.688326  0.705497   
MRK   0.720435  0.632774  0.685925  0.610580  0.688326  1.000000  0.620179   
MSFT  0.606554  0.665440  0.630570  0.569149  0.705497  0.620179  1.000000   
PFE   0.710511  0.627674  0.630108  0.599965  0.687126  0.723702  0.620668   
PG    0.664540  0.593982  0.660393  0.566643  0.655894  0.646314  0.579561   
T     0.619650  0.661625  0.637338  0.555407  0.645148  0.642262  0.608858   
TRV   0.625928  0.728347  0.675313  0.598593  0.739503  0.654874  0.600154   
UNH   0.620315  0.593633  0.618663  0.534163  0.610730  0.611829  0.562731   
UTX   0.725406  0.718998  0.710645  0.624908  0.848424  0.694618  0.723456   
VZ    0.634423  0.606947  0.592759  0.522129  0.635813  0.620811  0.564451   
WMT   0.574580  0.552472  0.568968  0.571420  0.610972  0.571786  0.579684   
XOM   0.724311  0.712734  0.710473  0.567184  0.748141  0.699390  0.703494   

           PFE        PG         T       TRV       UNH       UTX        VZ  \
AA    0.692207  0.533456  0.662901  0.697297  0.626418  0.800084  0.613422   
AXP   0.661628  0.638269  0.658365  0.690480  0.645823  0.770001  0.613442   
BA    0.661427  0.634056  0.647409  0.693580  0.644480  0.818340  0.576083   
BAC   0.580332  0.461198  0.585394  0.692810  0.577014  0.650226  0.536747   
CAT   0.653362  0.569150  0.611656  0.679448  0.642021  0.844137  0.589583   
CSCO  0.524557  0.500971  0.454966  0.550598  0.502656  0.611440  0.472622   
CVX   0.695621  0.628252  0.659306  0.707205  0.628023  0.779919  0.627931   
DD    0.690313  0.617922  0.686551  0.710612  0.640953  0.815454  0.630868   
DIS   0.650876  0.611371  0.669819  0.710623  0.651940  0.786531  0.617529   
GE    0.706027  0.613507  0.680358  0.677900  0.632988  0.777018  0.684984   
HD    0.638876  0.556490  0.597554  0.624701  0.612200  0.673500  0.567786   
HPQ   0.474586  0.431871  0.494590  0.482071  0.407039  0.610108  0.424424   
IBM   0.623725  0.610044  0.678023  0.589566  0.611192  0.748190  0.586035   
INTC  0.550615  0.551303  0.545211  0.608157  0.547778  0.692028  0.508896   
JNJ   0.710511  0.664540  0.619650  0.625928  0.620315  0.725406  0.634423   
JPM   0.627674  0.593982  0.661625  0.728347  0.593633  0.718998  0.606947   
KO    0.630108  0.660393  0.637338  0.675313  0.618663  0.710645  0.592759   
MCD   0.599965  0.566643  0.555407  0.598593  0.534163  0.624908  0.522129   
MMM   0.687126  0.655894  0.645148  0.739503  0.610730  0.848424  0.635813   
MRK   0.723702  0.646314  0.642262  0.654874  0.611829  0.694618  0.620811   
MSFT  0.620668  0.579561  0.608858  0.600154  0.562731  0.723456  0.564451   
PFE   1.000000  0.576964  0.597129  0.642421  0.590014  0.675389  0.628915   
PG    0.576964  1.000000  0.668227  0.607292  0.492360  0.677481  0.591762   
T     0.597129  0.668227  1.000000  0.657551  0.604891  0.648988  0.756705   
TRV   0.642421  0.607292  0.657551  1.000000  0.665523  0.683029  0.587940   
UNH   0.590014  0.492360  0.604891  0.665523  1.000000  0.660746  0.486421   
UTX   0.675389  0.677481  0.648988  0.683029  0.660746  1.000000  0.605494   
VZ    0.628915  0.591762  0.756705  0.587940  0.486421  0.605494  1.000000   
WMT   0.552283  0.618861  0.529654  0.619793  0.499349  0.601957  0.549769   
XOM   0.715801  0.666614  0.692532  0.706332  0.654499  0.776531  0.617919   

           WMT       XOM  
AA    0.517511  0.747023  
AXP   0.575717  0.699433  
BA    0.587670  0.734805  
BAC   0.479790  0.598516  
CAT   0.538195  0.753005  
CSCO  0.515317  0.581742  
CVX   0.556602  0.905136  
DD    0.566875  0.774908  
DIS   0.581024  0.720534  
GE    0.556110  0.761815  
HD    0.692174  0.639149  
HPQ   0.374181  0.523942  
IBM   0.489173  0.675966  
INTC  0.489745  0.610824  
JNJ   0.574580  0.724311  
JPM   0.552472  0.712734  
KO    0.568968  0.710473  
MCD   0.571420  0.567184  
MMM   0.610972  0.748141  
MRK   0.571786  0.699390  
MSFT  0.579684  0.703494  
PFE   0.552283  0.715801  
PG    0.618861  0.666614  
T     0.529654  0.692532  
TRV   0.619793  0.706332  
UNH   0.499349  0.654499  
UTX   0.601957  0.776531  
VZ    0.549769  0.617919  
WMT   1.000000  0.550944  
XOM   0.550944  1.000000  

或者,另一种方法是选择部分数据来查看。在下面的代码中,我使用ix

切割表格的前10行和前10列
In [7]: df.corr().ix[0:10,0:10]
Out[7]: 
            AA       AXP        BA       BAC       CAT      CSCO       CVX  \
AA    1.000000  0.768692  0.758487  0.738168  0.838511  0.584911  0.785955   
AXP   0.768692  1.000000  0.746401  0.760255  0.736557  0.553068  0.703163   
BA    0.758487  0.746401  1.000000  0.657093  0.770767  0.540786  0.721736   
BAC   0.738168  0.760255  0.657093  1.000000  0.657254  0.518776  0.620971   
CAT   0.838511  0.736557  0.770767  0.657254  1.000000  0.572002  0.798452   
CSCO  0.584911  0.553068  0.540786  0.518776  0.572002  1.000000  0.577381   
CVX   0.785955  0.703163  0.721736  0.620971  0.798452  0.577381  1.000000   
DD    0.851112  0.759933  0.760574  0.675753  0.851070  0.582742  0.803719   
DIS   0.751486  0.742574  0.782171  0.660317  0.746241  0.554223  0.713484   
GE    0.765963  0.755788  0.752602  0.699929  0.724883  0.576092  0.741695   

            DD       DIS        GE  
AA    0.851112  0.751486  0.765963  
AXP   0.759933  0.742574  0.755788  
BA    0.760574  0.782171  0.752602  
BAC   0.675753  0.660317  0.699929  
CAT   0.851070  0.746241  0.724883  
CSCO  0.582742  0.554223  0.576092  
CVX   0.803719  0.713484  0.741695  
DD    1.000000  0.773421  0.768493  
DIS   0.773421  1.000000  0.768324  
GE    0.768493  0.768324  1.000000  

为了表明这不是大熊猫无法处理数据量的问题,相关表的计算只需要一毫秒:

In [3]: %timeit df.corr()
1000 loops, best of 3: 1.18 ms per loop