Question

我有几个100 MB的文件。这些文件的格式如下所示：

super

（注意实际文件没有添加对齐空间，每个元素只分隔一个空格，为美学效果添加对齐）

每一行中的第一个元素是二进制分类，而行的其余部分是值为1的要素索引。例如，第三行表示行的第二行，第一行，第五和第六个特征是1，其余的是零。

我尝试从每个文件中读取每一行，并使用sparse.coo_matrix创建一个稀疏矩阵，如下所示：

0  1  2  5  8  67  9  122
1  4  5  2  5  8
0  2  1  5  6
.....

但这需要永远完成。我开始在晚上阅读数据，让我在睡觉后让电脑运行，当我醒来时，仍然没有完成第一个文件！

处理此类数据的更好方法是什么？

Answer 1

我认为这会比你的方法快一些，因为它不会逐行读取文件。您可以使用一个文件的一小部分来尝试此代码，并与您的代码进行比较此代码还需要提前知道功能号。如果我们不知道功能号码，则需要另外一行注释掉的代码。

import pandas as pd
from scipy.sparse import lil_matrix
from functools import partial


def writeMx(result, row):
    # zero-based matrix requires the feature number minus 1
    col_ind = row.dropna().values - 1
    # Assign values without duplicating row index and values
    result[row.name, col_ind] = 1


def fileToMx(f):
    # number of features
    col_n = 136
    df = pd.read_csv(f, names=list(range(0,col_n+2)),sep=' ')
    # This is the label of the binary classification
    label = df.pop(0)
    # Or get the feature number by the line below
    # But it would not be the same across different files
    # col_n = df.max().max()
    # Number of row
    row_n = len(label)
    # Generate feature matrix for one file
    result = lil_matrix((row_n, col_n))
    # Save features in matrix
    # DataFrame.apply() is usually faster than normal looping
    df.apply(partial(writeMx, result), axis=0)
    return(result)

for train in train_files:
    # result is the sparse matrix you can further save or use
    result = fileToMx(train)
    print(result.shape, result.nnz)
    # The shape of matrix and number of nonzero values
    # ((420, 136), 15)

如何在python中预处理非常大的数据

1 个答案: