我有以下从网上下载的代码。我已经通过分解进行修改,以便在雅虎上更容易。
我的问题是关于python(一般)和python + pandas能够处理比我在这里尝试的更多数据的能力。当我运行此代码时,如果我计算了符号之间的所有相关性,它最终会阻塞(参见“它显示这个”部分)。如果我删除一些计算,它似乎没问题。我不确定什么是chocking,我认为这是熊猫?
分解这段代码的正确方法是什么,以便它不会失去其简洁性[而不是使用矢量化的循环],并且仍然能够处理更多的数据?我希望能够处理存储在文件中的10年1分钟数据,如果它甚至无法处理一年的每日数据,那么它将无法在该数据集上运行。
所以我的问题是:
修复此程序的正确方法是什么(希望我可以概括)以便它可以在DOW 30符号上运行?
import pandas
from matplotlib.pyplot import show, legend
from datetime import datetime
from matplotlib import finance
import numpy
# 2011 to 2012
start = datetime(2011, 01, 01)
end = datetime(2012, 01, 01)
symbolsAK = ["AA", "AXP", "BA", "BAC", "CAT",
"CSCO", "CVX", "DD", "DIS", "GE", "HD",
"HPQ", "IBM", "INTC", "JNJ", "JPM",
"KO"]
symbolsMP = ["MCD", "MMM", "MRK", "MSFT", "PFE", "PG"]
#symbolsTX = ["T", "TRV", "UNH", "UTX", "VZ", "WMT", "XOM"]
symbols = symbolsAK
symbols = symbols + symbolsMP
#symbols = symbols + symbolsTX
quotesAK = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
for symbol in symbolsAK]
quotesMP = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
for symbol in symbolsMP]
#quotesTX = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True)
# for symbol in symbolsTX]
quotes = quotesAK
quotes = quotes + quotesMP
#quotes = quotes + quotesTX
close = numpy.array([q.close for q in quotes]).astype(numpy.float)
dates = numpy.array([q.date for q in quotes])
data = {}
for i in xrange(len(symbols)):
data[symbols[i]] = numpy.diff(numpy.log(close[i]))
df = pandas.DataFrame(data, index=dates[0][:-1], columns=symbols)
print df.corr()
它看起来像(某些)这个[缩短]
的输出# AA AXP BA BAC CAT
#AA 1.000000 0.768484 0.758264 0.737625 0.837643
#AXP 0.768484 1.000000 0.746898 0.760043 0.736337
#BA 0.758264 0.746898 1.000000 0.657075 0.770696
#BAC 0.737625 0.760043 0.657075 1.000000 0.657113
#CAT 0.837643 0.736337 0.770696 0.657113 1.000000
相反它显示了这个
<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, AA to PG
Data columns (total 23 columns):
AA 23 non-null values
AXP 23 non-null values
BA 23 non-null values
BAC 23 non-null values
CAT 23 non-null values
CSCO 23 non-null values
CVX 23 non-null values
DD 23 non-null values
DIS 23 non-null values
GE 23 non-null values
HD 23 non-null values
HPQ 23 non-null values
IBM 23 non-null values
INTC 23 non-null values
JNJ 23 non-null values
JPM 23 non-null values
KO 23 non-null values
MCD 23 non-null values
MMM 23 non-null values
MRK 23 non-null values
MSFT 23 non-null values
PFE 23 non-null values
PG 23 non-null values
dtypes: float64(23)
答案 0 :(得分:5)
它认为这不是内存或速度问题,而只是pandas控制台输出格式的问题(请参阅http://pandas.pydata.org/pandas-docs/stable/basics.html#working-with-package-options)
如果DataFrame太大而无法在控制台中显示,那么当您到达此处时,pandas将提供摘要视图(通过描述每列中有多少非空值)。在我的电脑上,这是最多20列和60行。但您可以更改此设置以显示更大的数据帧:
这是您获得的摘要视图:
In [2]: df.corr()
Out[2]:
<class 'pandas.core.frame.DataFrame'>
Index: 30 entries, AA to XOM
Data columns (total 30 columns):
AA 30 non-null values
AXP 30 non-null values
BA 30 non-null values
BAC 30 non-null values
CAT 30 non-null values
CSCO 30 non-null values
CVX 30 non-null values
DD 30 non-null values
DIS 30 non-null values
GE 30 non-null values
HD 30 non-null values
HPQ 30 non-null values
IBM 30 non-null values
INTC 30 non-null values
JNJ 30 non-null values
JPM 30 non-null values
KO 30 non-null values
MCD 30 non-null values
MMM 30 non-null values
MRK 30 non-null values
MSFT 30 non-null values
PFE 30 non-null values
PG 30 non-null values
T 30 non-null values
TRV 30 non-null values
UNH 30 non-null values
UTX 30 non-null values
VZ 30 non-null values
WMT 30 non-null values
XOM 30 non-null values
dtypes: float64(30)
您可以更改要显示的列数:
In [5]: pandas.options.display.max_columns = 50
现在您将显示整个数据框:
In [6]: df.corr()
Out[6]:
AA AXP BA BAC CAT CSCO CVX \
AA 1.000000 0.768692 0.758487 0.738168 0.838511 0.584911 0.785955
AXP 0.768692 1.000000 0.746401 0.760255 0.736557 0.553068 0.703163
BA 0.758487 0.746401 1.000000 0.657093 0.770767 0.540786 0.721736
BAC 0.738168 0.760255 0.657093 1.000000 0.657254 0.518776 0.620971
CAT 0.838511 0.736557 0.770767 0.657254 1.000000 0.572002 0.798452
CSCO 0.584911 0.553068 0.540786 0.518776 0.572002 1.000000 0.577381
CVX 0.785955 0.703163 0.721736 0.620971 0.798452 0.577381 1.000000
DD 0.851112 0.759933 0.760574 0.675753 0.851070 0.582742 0.803719
DIS 0.751486 0.742574 0.782171 0.660317 0.746241 0.554223 0.713484
GE 0.765963 0.755788 0.752602 0.699929 0.724883 0.576092 0.741695
HD 0.614588 0.647022 0.661556 0.575308 0.621294 0.499147 0.647491
HPQ 0.595110 0.509375 0.573672 0.453443 0.594590 0.427063 0.487018
IBM 0.662302 0.635524 0.701338 0.501511 0.664953 0.525319 0.625367
INTC 0.610633 0.587957 0.633674 0.463658 0.634462 0.543521 0.580799
JNJ 0.676678 0.674556 0.668016 0.569427 0.677017 0.562765 0.707574
JPM 0.799386 0.803539 0.695899 0.843875 0.726496 0.583126 0.701845
KO 0.632974 0.649398 0.686937 0.504384 0.621104 0.496410 0.684645
MCD 0.590209 0.619359 0.608288 0.482579 0.557322 0.467640 0.584303
MMM 0.807796 0.760495 0.760528 0.674455 0.805890 0.598801 0.771571
MRK 0.683526 0.675286 0.683141 0.594330 0.630897 0.528784 0.669215
MSFT 0.708997 0.670527 0.675077 0.579668 0.672689 0.629810 0.676658
PFE 0.692207 0.661628 0.661427 0.580332 0.653362 0.524557 0.695621
PG 0.533456 0.638269 0.634056 0.461198 0.569150 0.500971 0.628252
T 0.662901 0.658365 0.647409 0.585394 0.611656 0.454966 0.659306
TRV 0.697297 0.690480 0.693580 0.692810 0.679448 0.550598 0.707205
UNH 0.626418 0.645823 0.644480 0.577014 0.642021 0.502656 0.628023
UTX 0.800084 0.770001 0.818340 0.650226 0.844137 0.611440 0.779919
VZ 0.613422 0.613442 0.576083 0.536747 0.589583 0.472622 0.627931
WMT 0.517511 0.575717 0.587670 0.479790 0.538195 0.515317 0.556602
XOM 0.747023 0.699433 0.734805 0.598516 0.753005 0.581742 0.905136
DD DIS GE HD HPQ IBM INTC \
AA 0.851112 0.751486 0.765963 0.614588 0.595110 0.662302 0.610633
AXP 0.759933 0.742574 0.755788 0.647022 0.509375 0.635524 0.587957
BA 0.760574 0.782171 0.752602 0.661556 0.573672 0.701338 0.633674
BAC 0.675753 0.660317 0.699929 0.575308 0.453443 0.501511 0.463658
CAT 0.851070 0.746241 0.724883 0.621294 0.594590 0.664953 0.634462
CSCO 0.582742 0.554223 0.576092 0.499147 0.427063 0.525319 0.543521
CVX 0.803719 0.713484 0.741695 0.647491 0.487018 0.625367 0.580799
DD 1.000000 0.773421 0.768493 0.660224 0.587773 0.674010 0.627005
DIS 0.773421 1.000000 0.768324 0.643008 0.609767 0.678413 0.607358
GE 0.768493 0.768324 1.000000 0.649000 0.553156 0.656494 0.625745
HD 0.660224 0.643008 0.649000 1.000000 0.459635 0.575951 0.572010
HPQ 0.587773 0.609767 0.553156 0.459635 1.000000 0.582698 0.548928
IBM 0.674010 0.678413 0.656494 0.575951 0.582698 1.000000 0.633732
INTC 0.627005 0.607358 0.625745 0.572010 0.548928 0.633732 1.000000
JNJ 0.714763 0.654975 0.683914 0.589519 0.494923 0.602186 0.571545
JPM 0.767345 0.737792 0.795344 0.601889 0.521005 0.602322 0.569887
KO 0.696257 0.656332 0.674888 0.631668 0.443318 0.694586 0.574671
MCD 0.583090 0.569733 0.556076 0.608105 0.337828 0.569540 0.491635
MMM 0.799806 0.775277 0.797455 0.654009 0.578911 0.676061 0.650945
MRK 0.671173 0.690316 0.687744 0.574417 0.448651 0.627232 0.547941
MSFT 0.703819 0.684609 0.679975 0.631967 0.521019 0.682591 0.662063
PFE 0.690313 0.650876 0.706027 0.638876 0.474586 0.623725 0.550615
PG 0.617922 0.611371 0.613507 0.556490 0.431871 0.610044 0.551303
T 0.686551 0.669819 0.680358 0.597554 0.494590 0.678023 0.545211
TRV 0.710612 0.710623 0.677900 0.624701 0.482071 0.589566 0.608157
UNH 0.640953 0.651940 0.632988 0.612200 0.407039 0.611192 0.547778
UTX 0.815454 0.786531 0.777018 0.673500 0.610108 0.748190 0.692028
VZ 0.630868 0.617529 0.684984 0.567786 0.424424 0.586035 0.508896
WMT 0.566875 0.581024 0.556110 0.692174 0.374181 0.489173 0.489745
XOM 0.774908 0.720534 0.761815 0.639149 0.523942 0.675966 0.610824
JNJ JPM KO MCD MMM MRK MSFT \
AA 0.676678 0.799386 0.632974 0.590209 0.807796 0.683526 0.708997
AXP 0.674556 0.803539 0.649398 0.619359 0.760495 0.675286 0.670527
BA 0.668016 0.695899 0.686937 0.608288 0.760528 0.683141 0.675077
BAC 0.569427 0.843875 0.504384 0.482579 0.674455 0.594330 0.579668
CAT 0.677017 0.726496 0.621104 0.557322 0.805890 0.630897 0.672689
CSCO 0.562765 0.583126 0.496410 0.467640 0.598801 0.528784 0.629810
CVX 0.707574 0.701845 0.684645 0.584303 0.771571 0.669215 0.676658
DD 0.714763 0.767345 0.696257 0.583090 0.799806 0.671173 0.703819
DIS 0.654975 0.737792 0.656332 0.569733 0.775277 0.690316 0.684609
GE 0.683914 0.795344 0.674888 0.556076 0.797455 0.687744 0.679975
HD 0.589519 0.601889 0.631668 0.608105 0.654009 0.574417 0.631967
HPQ 0.494923 0.521005 0.443318 0.337828 0.578911 0.448651 0.521019
IBM 0.602186 0.602322 0.694586 0.569540 0.676061 0.627232 0.682591
INTC 0.571545 0.569887 0.574671 0.491635 0.650945 0.547941 0.662063
JNJ 1.000000 0.649433 0.661615 0.591725 0.736881 0.720435 0.606554
JPM 0.649433 1.000000 0.584480 0.520379 0.764575 0.632774 0.665440
KO 0.661615 0.584480 1.000000 0.659553 0.684177 0.685925 0.630570
MCD 0.591725 0.520379 0.659553 1.000000 0.639054 0.610580 0.569149
MMM 0.736881 0.764575 0.684177 0.639054 1.000000 0.688326 0.705497
MRK 0.720435 0.632774 0.685925 0.610580 0.688326 1.000000 0.620179
MSFT 0.606554 0.665440 0.630570 0.569149 0.705497 0.620179 1.000000
PFE 0.710511 0.627674 0.630108 0.599965 0.687126 0.723702 0.620668
PG 0.664540 0.593982 0.660393 0.566643 0.655894 0.646314 0.579561
T 0.619650 0.661625 0.637338 0.555407 0.645148 0.642262 0.608858
TRV 0.625928 0.728347 0.675313 0.598593 0.739503 0.654874 0.600154
UNH 0.620315 0.593633 0.618663 0.534163 0.610730 0.611829 0.562731
UTX 0.725406 0.718998 0.710645 0.624908 0.848424 0.694618 0.723456
VZ 0.634423 0.606947 0.592759 0.522129 0.635813 0.620811 0.564451
WMT 0.574580 0.552472 0.568968 0.571420 0.610972 0.571786 0.579684
XOM 0.724311 0.712734 0.710473 0.567184 0.748141 0.699390 0.703494
PFE PG T TRV UNH UTX VZ \
AA 0.692207 0.533456 0.662901 0.697297 0.626418 0.800084 0.613422
AXP 0.661628 0.638269 0.658365 0.690480 0.645823 0.770001 0.613442
BA 0.661427 0.634056 0.647409 0.693580 0.644480 0.818340 0.576083
BAC 0.580332 0.461198 0.585394 0.692810 0.577014 0.650226 0.536747
CAT 0.653362 0.569150 0.611656 0.679448 0.642021 0.844137 0.589583
CSCO 0.524557 0.500971 0.454966 0.550598 0.502656 0.611440 0.472622
CVX 0.695621 0.628252 0.659306 0.707205 0.628023 0.779919 0.627931
DD 0.690313 0.617922 0.686551 0.710612 0.640953 0.815454 0.630868
DIS 0.650876 0.611371 0.669819 0.710623 0.651940 0.786531 0.617529
GE 0.706027 0.613507 0.680358 0.677900 0.632988 0.777018 0.684984
HD 0.638876 0.556490 0.597554 0.624701 0.612200 0.673500 0.567786
HPQ 0.474586 0.431871 0.494590 0.482071 0.407039 0.610108 0.424424
IBM 0.623725 0.610044 0.678023 0.589566 0.611192 0.748190 0.586035
INTC 0.550615 0.551303 0.545211 0.608157 0.547778 0.692028 0.508896
JNJ 0.710511 0.664540 0.619650 0.625928 0.620315 0.725406 0.634423
JPM 0.627674 0.593982 0.661625 0.728347 0.593633 0.718998 0.606947
KO 0.630108 0.660393 0.637338 0.675313 0.618663 0.710645 0.592759
MCD 0.599965 0.566643 0.555407 0.598593 0.534163 0.624908 0.522129
MMM 0.687126 0.655894 0.645148 0.739503 0.610730 0.848424 0.635813
MRK 0.723702 0.646314 0.642262 0.654874 0.611829 0.694618 0.620811
MSFT 0.620668 0.579561 0.608858 0.600154 0.562731 0.723456 0.564451
PFE 1.000000 0.576964 0.597129 0.642421 0.590014 0.675389 0.628915
PG 0.576964 1.000000 0.668227 0.607292 0.492360 0.677481 0.591762
T 0.597129 0.668227 1.000000 0.657551 0.604891 0.648988 0.756705
TRV 0.642421 0.607292 0.657551 1.000000 0.665523 0.683029 0.587940
UNH 0.590014 0.492360 0.604891 0.665523 1.000000 0.660746 0.486421
UTX 0.675389 0.677481 0.648988 0.683029 0.660746 1.000000 0.605494
VZ 0.628915 0.591762 0.756705 0.587940 0.486421 0.605494 1.000000
WMT 0.552283 0.618861 0.529654 0.619793 0.499349 0.601957 0.549769
XOM 0.715801 0.666614 0.692532 0.706332 0.654499 0.776531 0.617919
WMT XOM
AA 0.517511 0.747023
AXP 0.575717 0.699433
BA 0.587670 0.734805
BAC 0.479790 0.598516
CAT 0.538195 0.753005
CSCO 0.515317 0.581742
CVX 0.556602 0.905136
DD 0.566875 0.774908
DIS 0.581024 0.720534
GE 0.556110 0.761815
HD 0.692174 0.639149
HPQ 0.374181 0.523942
IBM 0.489173 0.675966
INTC 0.489745 0.610824
JNJ 0.574580 0.724311
JPM 0.552472 0.712734
KO 0.568968 0.710473
MCD 0.571420 0.567184
MMM 0.610972 0.748141
MRK 0.571786 0.699390
MSFT 0.579684 0.703494
PFE 0.552283 0.715801
PG 0.618861 0.666614
T 0.529654 0.692532
TRV 0.619793 0.706332
UNH 0.499349 0.654499
UTX 0.601957 0.776531
VZ 0.549769 0.617919
WMT 1.000000 0.550944
XOM 0.550944 1.000000
或者,另一种方法是选择部分数据来查看。在下面的代码中,我使用ix
:
In [7]: df.corr().ix[0:10,0:10]
Out[7]:
AA AXP BA BAC CAT CSCO CVX \
AA 1.000000 0.768692 0.758487 0.738168 0.838511 0.584911 0.785955
AXP 0.768692 1.000000 0.746401 0.760255 0.736557 0.553068 0.703163
BA 0.758487 0.746401 1.000000 0.657093 0.770767 0.540786 0.721736
BAC 0.738168 0.760255 0.657093 1.000000 0.657254 0.518776 0.620971
CAT 0.838511 0.736557 0.770767 0.657254 1.000000 0.572002 0.798452
CSCO 0.584911 0.553068 0.540786 0.518776 0.572002 1.000000 0.577381
CVX 0.785955 0.703163 0.721736 0.620971 0.798452 0.577381 1.000000
DD 0.851112 0.759933 0.760574 0.675753 0.851070 0.582742 0.803719
DIS 0.751486 0.742574 0.782171 0.660317 0.746241 0.554223 0.713484
GE 0.765963 0.755788 0.752602 0.699929 0.724883 0.576092 0.741695
DD DIS GE
AA 0.851112 0.751486 0.765963
AXP 0.759933 0.742574 0.755788
BA 0.760574 0.782171 0.752602
BAC 0.675753 0.660317 0.699929
CAT 0.851070 0.746241 0.724883
CSCO 0.582742 0.554223 0.576092
CVX 0.803719 0.713484 0.741695
DD 1.000000 0.773421 0.768493
DIS 0.773421 1.000000 0.768324
GE 0.768493 0.768324 1.000000
为了表明这不是大熊猫无法处理数据量的问题,相关表的计算只需要一毫秒:
In [3]: %timeit df.corr()
1000 loops, best of 3: 1.18 ms per loop