Question

我目前正在进行一些内存密集型文本处理，我必须构建一个sparse matrix float32s，其维度为~ (2M, 5M)。我在阅读5M文档的语料库时逐列构建此矩阵。为此，我使用dok_matrix中的稀疏SciPy数据结构。但是，当到达第500 000个文档时，我的内存已满（使用大约30GB）并且程序崩溃。我最终想要做的是使用sklearn在矩阵上执行降维算法，但是，如上所述，不可能在内存中保持和构造整个矩阵。我已经研究了numpy.memmap，因为sklearn支持这一点，并尝试memmap SciPy稀疏矩阵的一些基本numpy数据结构，但我无法成功实现这一点。

我不可能以密集格式保存整个矩阵，因为这需要40TB的磁盘空间。所以我认为HDF5和PyTables不适合我（？）。

我现在的问题是：如何动态构建稀疏矩阵，但是直接写入磁盘而不是内存，以后我可以在sklearn中使用它？

谢谢！

Answer 1

在处理磁盘上的大型稀疏数据集的单细胞基因组数据领域，我们遇到了类似的问题。我将向您展示一个如何处理此问题的简单示例。我的假设是您非常受内存限制，可能无法将稀疏矩阵的多个副本同时放入内存。即使您无法容纳一个完整的副本，这也将起作用。

我将逐列构造磁盘稀疏CSC矩阵。稀疏的csc矩阵使用3个基础数组：

data：存储在矩阵中的值
indices：矩阵中每个值的行索引
indptr：长度为n_cols + 1的数组，该数组将indices和data除以它们所属的列。

作为说明性示例，列i的值存储在indptr[i]:indptr[i+1]的范围data中。同样，这些值的行索引可以通过indices[indptr[i]:indptr[i+1]]找到。

要模拟您的数据生成过程（假设是分析文档），我将定义一个函数process_document，该函数返回相关文档的indices和data的值。 / p>

import numpy as np
import h5py
from scipy import sparse

from tqdm import tqdm  # For monitoring the writing process
from typing import Tuple, Union  # Just for argument annotation

def process_document():
    """
    Simulate processing a document. Results in sparse vector represenation.
    """
    n_items = np.random.negative_binomial(2, .0001)
    indices = np.random.choice(2_000_000, n_items, replace=False)
    indices.sort()
    data = np.random.random(n_items).astype(np.float32)
    return indices, data

def data_generator(n):
    """Iterator which yields simulated data."""
    for i in range(n):
        yield process_document()

现在，我将在hdf5文件中创建一个组，该文件将存储稀疏矩阵的组成数组。

def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
    """
    Create a group in an hdf5 file that can store a CSC sparse matrix.
    """
    g = f.create_group(groupname)
    g.attrs["shape"] = shape
    g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
    g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
    g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
    return g

最后是一个将这个组读取为稀疏矩阵的函数（这很简单）。

def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
    return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])

现在，我们将创建磁盘稀疏矩阵并一次向其写入一列（我使用的列较少，因为这可能有点慢）。

N_COLS = 10

def make_disk_matrix(f, groupname, data_iter, shape):
    group = make_sparse_csc_group(f, "mtx", shape)

    indptr = group["indptr"]
    data = group["data"]
    indices = group["indices"]
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices.resize((n_total,))
        data.resize((n_total,))
        indices[n_prev:] = cur_indices
        data[n_prev:] = cur_data
        indptr[doc_num+1] = n_total

# Writing
with h5py.File("data.h5", "w") as f:
    make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))

# Reading
with h5py.File("data.h5", "r") as f:
    mtx = read_sparse_csc_group(f["mtx"])

同样，这是考虑到非常受内存限制的情况，在这种情况下，创建稀疏矩阵时可能无法使其适合内存。如果您可以处理整个稀疏矩阵以及至少一个副本，那么一种更快的方法是不打扰磁盘存储（类似于其他建议）。但是，对这段代码进行一些修改会为您提供更好的性能：

def make_memory_mtx(data_iter, shape):
    indices_list = []
    data_list = []
    indptr = np.zeros(shape[1]+1, dtype=int)
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices_list.append(cur_indices)
        data_list.append(cur_data)
        indptr[doc_num+1] = n_total

    indices = np.concatenate(indices_list)
    data = np.concatenate(data_list)

    return sparse.csc_matrix((data, indices, indptr), shape=shape)

mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))

这应该相当快，因为它仅在连接数组后才复制数据。当前发布的其他解决方案在处理时会重新分配阵列，从而制作大型阵列的许多副本。

Answer 2

如果您能提供最少的工作代码，那就太好了。我看不到矩阵是通过构造（1）还是由于数据太多（2）而变得太大。如果您不太在意自己构建此矩阵，则可以直接查看我的评论2。

对于问题（1），在下面的示例代码中，我制作了一个包装器类，以逐块构建一个csr_matrix块。想法是仅添加（行，列，数据）列表的元组，直到达到缓冲区限制（请参见备注1），并在此时实际更新矩阵。达到限制后，由于csr_matrix构造函数将添加具有相同（行，列）元组的数据，因此它将减少内存中的数据。这部分仅允许您以快速的方式构造稀疏矩阵（比为每行创建稀疏矩阵要快得多），并且避免了当单词在文档中多次出现时由于行（列）的冗余而导致的存储错误。。

import numpy as np
import scipy.sparse

class SparseMatrixBuilder():
    def __init__(self, shape, build_size_limit):
        self.sparse_matrix = scipy.sparse.csr_matrix(shape)
        self.shape = shape
        self.build_size_limit = build_size_limit
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []


    def add(self, data, col_indices, row_indices):
        self.data_temp.append(data)
        self.col_indices_temp.append(col_indices)
        self.row_indices_temp.append(row_indices)
        if len(self.data_temp) == self.build_size_limit:
            self.sparse_matrix += scipy.sparse.csr_matrix(
                (np.concatenate(self.data_temp),
                 (np.concatenate(self.col_indices_temp),
                  np.concatenate(self.row_indices_temp))),
                shape=self.shape
            )
            self.data_temp = []
            self.col_indices_temp = []
            self.row_indices_temp = []

    def get_matrix(self):
        self.sparse_matrix += scipy.sparse.csr_matrix(
            (np.concatenate(self.data_temp),
             (np.concatenate(self.col_indices_temp),
              np.concatenate(self.row_indices_temp))),
            shape=self.shape
        )
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []
        return self.sparse_matrix

对于问题（2），您可以通过添加一个save方法轻松地扩展此类，一旦达到限制（或第二个限制），该方法将矩阵存储在磁盘上。这样，您最终将在磁盘上获得多个稀疏矩阵块。然后，您将需要一种可以处理分块矩阵的降维算法（请参阅备注2）。

备注1：此处的缓冲区限制定义不正确。与计算机上可用的RAM相比，最好检查numpy数组data_temp，col_indices_temp和row_indices_temp的实际大小（使用python进行自动化很容易）。

备注2：gensim是一个python库，它具有使用分块文件构建NLP模型的巨大优势。因此，您可以构建一个字典，构建一个稀疏矩阵，并使用该库减小维度，而无需太多RAM。

Answer 3

我假设您的数据可以使用更加内存友好的稀疏矩阵格式（例如COO）容纳在内存中。如果不是这样，即使使用sklearn，也几乎没有希望能够继续进行mmap。实际上，sklearn可能会创建后续对象，这些对象的内存要求与输入的数量级相同。

Scipy的dok_matrix实际上是香草dict的子类。它们使用单独的python对象和大量的指针存储数据，因此它们的存储效率不高。最紧凑的表示形式是coo_matrix格式。您可以通过为坐标（行和列）和数据预先分配数组来逐步构建创建COO矩阵所需的数据。如果您最初的猜测是错误的，最终增加这些缓冲区。


def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
    counter = 0
    rows = numpy.empty(n_data_hint, dtype=idx_dtype)
    cols = numpy.empty(n_data_hint, dtype=idx_dtype)
    data = numpy.empty(n_data_hint, dtype=data_dtype)
    for row, col, value in iterable:
        if counter >= n_data_hint:
            n_data_hint *= 2
            rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
        rows[counter] = row
        cols[counter] = col
        data[counter] = value
        counter += 1
    rows = rows[:counter]
    cols = cols[:counter]
    data = data[:counter]
    return coo_matrix((data, (rows, cols)))


def _reallocate(rows, cols, data, n):
    new_rows = numpy.empty(n, dtype=rows.dtype)
    new_cols = numpy.empty(n, dtype=cols.dtype)
    new_data = numpy.empty(n, dtype=data.dtype)
    new_rows[:rows.size] = rows
    new_cols[:cols.size] = cols
    new_data[:data.size] = data
    return new_rows, new_cols, new_data

您可以使用以下随机生成的数据对其进行测试：

def get_random_data(n, max_row=2000, max_col=5000):
    for _ in range(n):
        row = numpy.random.choice(max_row)
        col = numpy.random.choice(max_col)
        val = numpy.random.randn()
        yield row, col, val

# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)

# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)

有了COO矩阵后，您可能希望使用coo.tocsr()转换为CSR。 CSR矩阵针对点运算等常见操作进行了更优化。在某些行最初为空的情况下，它需要更多的内存。这是因为它存储了所有行甚至空行的指针。

在Python中动态构造磁盘上的稀疏矩阵

3 个答案: