Question

我有一个~50GB的拟合文件，包含多个HDU，它们都具有相同的格式：带有1E5对象和1E6时间戳的（1E5 x 1E6）阵列。 HDU描述了不同的物理属性，例如Flux，RA，DEC等。我想从每个HDU只读取5个对象（即（5×1E6）阵列）。

python 2.7， astropy 1.0.3， linux x86_64

我到目前为止尝试了很多我发现的建议，但没有任何效果。我最好的方法仍然是：

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array

此代码适用于最大约20 GB的文件，但是对于较大的文件而言内存不足（较大的文件只包含更多的对象，而不是更多的时间戳）。我不明白为什么 - astropy.io.fits本质上使用mmap并且应该只将（5x1E6）数组加载到内存中据我所知？因此，独立于文件大小，我想要读出的内容总是具有相同的大小。

编辑 - 这是错误消息：

  dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/utils/decorators.py", line 341, in __get__
  val = self._fget(obj)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 239, in data
  data = self._get_scaled_image_data(self._data_offset, self.shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 585, in _get_scaled_image_data
  raw_data = self._get_raw_data(shape, code, offset)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/base.py", line 523, in _get_raw_data
  return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/file.py", line 248, in readarray
  shape=shape).view(np.ndarray)
File "/usr/local/python/lib/python2.7/site-packages/numpy/core/memmap.py", line 254, in __new__
  mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 12] Cannot allocate memory

编辑2：谢谢，我现在包含了建议，它使我能够处理高达50GB的文件。新代码：

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, mode='denywrite', memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    del hdulist['FLUX'].data
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    del hdulist['RA'].data
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
    del hdulist['DEC'].data

在

mode='denywrite'

没有引起任何改变。

memmap=True

确实不是默认值，需要手动设置。

del hdulist['FLUX'].data

等现在允许我读取50GB而不是20GB文件

新问题：任何大于50GB的内容仍然会导致相同的内存错误 - 但现在直接在第一行。

dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array

Answer 1

看起来您遇到过这个问题：https://groups.google.com/forum/#!topic/orient-database/f7bd3s4f3Jo

这里的问题是即使它正在使用mmap，它也会在写时复制模式下使用mmap，这意味着你的系统需要能够分配足够大的虚拟内存区域原则上如果您将数据写回mmap，则可以保存与mmap大小一样多的数据。

如果您将mode='denywrite'传递给fits.open()，它应该有效。任何修改数组的尝试都会导致错误，但如果您只想读取数据就没问题。

如果仍然无法实现这一点，您也可以尝试使用https://github.com/astropy/astropy/issues/1380模块，该模块可以更好地支持以较小的块读取文件。

astropy.io.fits使用mutliple HDU从大型文件中读取行

1 个答案: