正确的方法来获得平均值,描述大熊猫中大数据集的值

时间:2014-09-25 17:29:28

标签: pandas hdfstore

我得到"阵列太大"对于hdf_read,这可能意味着我必须遍历文件并在将它们组合在一起之前以块的形式计算结果;我想知道是否有自动方式这样做?或者也许是一种我不知道的更好的方式?

任何建议都会非常有用!

现在我使用以下内容加载文件:

res= pd.read_hdf(self.file, self.key, columns = get_columns)

接下来计算平均值:

describe = res.describe()
text=''
count = int(describe['count'])
text+= 'Count: %s\n' % (str(count))
text+= 'Mean: %s\n' % (str(describe['mean']))
text+= 'Standard Deviation: %s\n' % (str(describe['std']))
text+= 'Range: [%s, %s]\n' % (str(int(describe['min'])), str(int(describe['max'])))
text+= "25%%: %s\n" % (str(int(describe['25%'])))
text+= "50%% (median): %s\n" % (str(int(describe['50%'])))
text+= "75%%: %s\n" % (str(int(describe['75%'])))
text+= "Unbiased Kurtosis: %s\n" % (str(res.kurt()))
text+= "Unbiased Skew: %s\n" % (str(res.skew()))
text+= "Unbiased Variance: %s\n" % (str(res.var()))

在HDF文件(blosc中为812MB)上运行,产生

res= pd.read_hdf(self.file, self.key, columns = get_columns)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 330, in read_hdf
    return f(store, True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 322, in <lambda>
    key, auto_close=auto_close, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 669, in select
    auto_close=auto_close).get_values()
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 1335, in get_values
    results = self.func(self.start, self.stop)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 658, in func
    columns=columns, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 3822, in read
    if not self.read_axes(where=where, **kwargs):
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 3056, in read_axes
    values = self.selection.select()
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-i686.egg/pandas/io/pytables.py", line 4339, in select
    return self.table.table.read(start=self.start, stop=self.stop)
  File "/usr/lib/python2.7/dist-packages/tables/table.py", line 1975, in read
    arr = self._read(start, stop, step, field, out)
  File "/usr/lib/python2.7/dist-packages/tables/table.py", line 1865, in _read
    result = self._get_container(nrows)
  File "/usr/lib/python2.7/dist-packages/tables/table.py", line 958, in _get_container
    return numpy.empty(shape=shape, dtype=self._v_dtype)
ValueError: array is too big.

pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Linux
OS-release: 3.13.0-24-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.1
Cython: 0.20.1post0
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: 0.8.4
pymysql: None
psycopg2: None

ptdump: Here

0 个答案:

没有答案