Question

从天气预报服务DWD，您可以在高分辨率网格中下载类似csv的文件，其中包含历史降雨量（请参阅此处的所有内容，例如https://opendata.dwd.de/climate_environment/CDC/grids_germany/hourly/radolan/historical/asc/）。可以使用以下简单数据框将这些数据加载到python中

df = pd.read_csv(file_location, delimiter=' ', skiprows=6, header=None, usecols=range(900), na_values=[-1])

在我的应用程序中，我想观察一段时间内（例如2年）的各种情况。这里的问题是，要这样做，我需要将每个df加载到内存中，但只能访问一个值。这会导致大量的RAM使用（如果我将所有文件都保留在内存中）或许多文件的读取过程（如果每次访问都将文件加载到内存中）。为了克服此问题并使并行计算更容易，我想提取所有数据帧在给定行-列组合下所有值的列表。不幸的是，我找不到关于SO或其他方面如何有效执行此操作的示例。下面给出了一个简化的示例：

import pandas as pd
import numpy as np

dfs = []
for i in range(1000):
    dfs.append(pd.DataFrame(data=np.random.rand(900,900)))


for row in range(900):
    for column in range(900):
        extract all values at df[row, column] efficiently and save to file

非常感谢您的帮助！

Answer 1

可以将数据文件转换为二进制格式吗？在一个简单的测试中：

np.loadtxt()以0.264秒（文本格式）导入了900x900数组
np.load()以0.0012秒（二进制格式）导入了二进制版本
217倍加速

镶木地板和羽毛（在熊猫docs中）是其他高性能存储选项，dask可能有助于管理计算。

import numpy as np
from pathlib import Path
from time import perf_counter

data_dir = Path('../../../Downloads/RW-201912') 
text_file = 'RW_20191231-2350.asc'
bin_file = 'test.npy'

# 1. read text file
start = perf_counter()
with open(data_dir / text_file, 'rt') as handle:
    x = np.loadtxt(handle, skiprows=6)
elapsed = perf_counter() - start
print(x.shape, round(elapsed, 4))

# 2. write binary file
start = perf_counter()
with open(data_dir / bin_file, 'wb') as fp:
    np.save(fp, x.astype(np.int8))
elapsed = perf_counter() - start
print(round(elapsed, 4))

# 3. read binary file
start = perf_counter()
with open(data_dir / bin_file, 'rb') as handle:
    y = np.load(handle)
elapsed = perf_counter() - start
print(y.shape, round(elapsed, 4))

(900, 900) 0.2661 # read text file
0.0026            # write binary file
(900, 900) 0.0009 # read binary file

在同一行和同一列中访问许多数据框

1 个答案: