Question

首先，我已经阅读以下内容：

还有第一个链接中的更多链接，但是它们都不起作用...

我的问题是在Jupyter Notebook中打开巨大的（> 80 Mb / pc。）和大量（〜3000）FITS文件。相关代码段如下：

# Dictionary to store NxN data matrices of cropped image tiles
CroppedObjects = {}

# Defining some other, here used variable....
# ...

# Interate over all images ('j'), which contain the current object, indexed by 'i'
for i in range(0, len(finalObjects)):
    for j in range(0, len(containingImages[containedObj[i]])):

        countImages += 1

        # Path to the current image: 'mnt/...'
        current_image_path = ImagePaths[int(containingImages[containedObj[i]][j])]

        # Open .fits images
        with fits.open(current_image_path, memmap=False) as hdul:
            # Collect image data
            image_data = fits.getdata(current_image_path)

            # Collect WCS data from the current .fits's header
            ImageWCS = wcs.WCS(hdul[1].header)

            # Cropping parameters:
            # 1. Sky-coordinates of the croppable object
            # 2. Size of the crop, already defined above
            Coordinates = coordinates.SkyCoord(finalObjects[i][1]*u.deg,finalObjects[i][2]*u.deg, frame='fk5')
            size = (cropSize*u.pixel, cropSize*u.pixel)

            try:
                # Cut out the image tile
                cutout = Cutout2D(image_data, position=Coordinates, size=size, wcs=ImageWCS, mode='strict')

                # Write the cutout to a new FITS file
                cutout_filename = "Cropped_Images_Sorted/Cropped_" + str(containedObj[i]) + current_image_path[-23:]

                # Sava data to dictionary
                CroppedObjects[cutout_filename] = cutout.data

                foundImages += 1

            except:
                pass

            else:
                del image_data
                continue

        # Memory maintainance                
        gc.collect()

        # Progress bar
        sys.stdout.write("\rProgress: [{0}{1}] {2:.3f}%\tElapsed: {3}\tRemaining: {4}  {5}".format(u'\u2588' * int(countImages/allCrops * progressbar_width),
                                                                                                   u'\u2591' * (progressbar_width - int(countImages/allCrops * progressbar_width)),
                                                                                                   countImages/allCrops * 100,
                                                                                                   datetime.now()-starttime,
                                                                                                   (datetime.now()-starttime)/countImages * (allCrops - countImages),
                                                                                                   foundImages))

        sys.stdout.flush()

好的，实际上它做三件事：

打开特定的FITS文件
从中切出一个正方形（但是strict是，因此如果数组仅部分重叠，则try语句将跳至循环的下一步）
更新进度条

然后转到下一个文件，执行相同的操作并遍历我的所有FITS文件。

但是：如果我尝试运行此代码，则在找到约1000张图片后，它将停止并在行上给出OSError: [Errno 24] Too many open files：

image_data = fits.getdata(current_image_path)

我尝试了所有可以解决问题的方法，但没有任何帮助...甚至没有将内存映射设置为false或使用fits.getdata和gc.collect() ...也尝试过许多小的更改，例如在不使用try语句的情况下运行，切出所有图像图块，没有任何限制。 else语句中的del也是我的另一个惨痛尝试。我还能尝试使它最终起作用吗？
另外，请随时问我是否不清楚！我也将尽力帮助您理解问题！

Answer 1

过去我也遇到过类似的问题（请参阅here）。最后，我使它大致像这样工作：

total = 0
for filename in filenames:
    with fits.open(filename, memmap=False) as hdulist:
        data = hdulist['spam'].data
    total += data.sum()

一些注意事项：

使用fits.open打开memmap=False
在with块中使用它，以使文件关闭可靠
将with块保留为短，只需将所需的数据加载到内存中，然后通过退出将其关闭
关闭文件后，对数据进行所需的处理；可能并不需要，但是如果Python引用文件中的数据是阻止其被关闭的问题，则可以简化这种情况。我认为在您的示例中，抠图代码不是问题所在，但可能是-尝试取消注释吗？
不要做额外的fits.getdata，我认为它会再次打开文件
不需要del和gc.collect，如果代码如此处建议的那样简单，则不会有循环引用，Python会在作用域末尾可靠地删除对象

现在这可能无济于事，您仍然会遇到问题。在这种情况下，继续进行的方法是为Astropy开发人员创建一个对您不起作用的最小可复制示例（就像我做过here一样），然后向Astropy提出问题，让您的Python版本，Astropy版本和操作系统，或在此处发布。关键是：这很复杂，并且可能依赖于运行时/版本，因此，为了确定一个示例，任何人都可以运行，但对您来说失败了。

Answer 2

这行是伤害你的事情：

image_data = fits.getdata(current_image_path)

您刚刚使用memmap=False在上一行中打开了该文件，但是在该行中，您使用了memmap=True重新打开了该文件，并在保持对{ {1}}，方法是将其包装在image_data中，然后使用以下方法保留对数据的引用：

Cutout2D

据我所知，CroppedObjects[cutout_filename] = cutout.data不一定非要复制数据，因此您仍在有效地保持对Cutout2D的引用，即mmap'd。

解决方案：请勿在此处使用image_data。请参阅有关此in the docs的警告：

这些功能对于交互式Python会话和简单的分析脚本很有用，但由于效率极低，因此不应用于应用程序代码。例如，每次调用fits.getdata都需要重新解析整个FITS文件。重复使用这些功能的代码应改为使用getval()打开文件并直接访问数据结构。

因此，您要替换该行：

open()

使用

image_data = fits.getdata(current_image_path)

正如@Christoph在他的回答中所写，请摆脱所有image_data = hdul[1].data和del image_data的内容，因为它仍然无济于事。

附录：来自Cutout2D的API文档：

如果gc.collect()（默认），则剪切数据将成为原始数据数组的视图。如果为False，则剪切数据将保留原始数据数组的副本。

因此，这是明确说明（并且我通过偷看代码确认了这一点），True只是在查看原始数据数组，这意味着它一直在引用它。如果需要，可以通过调用Cutout2D来避免这种情况。如果这样做的话，您可能也可以取消Cutout2D(..., copy=True)。使用mmap可能有用也可能不会有用：它部分取决于图像的大小和可用的物理RAM。在您的情况下，可能会更快，因为您没有使用整个图像，而只是截取了它们。这意味着使用memmap=False可能会更高效，因为它可以避免将整个图像数组分页到内存中。

但这也可能取决于很多事情，因此，您可能想使用memmap=True + fits.open(..., memmap=False)与Cutout2D(..., copy=False) + fits.open(..., memmap=True)进行一些性能测试文件数量较少。

使用Astropy打开FITS时出现OSError 24

2 个答案: