Question

我试着去寻找这个，但我认为我的问题比一个简单的＆＃34;这种方式稍微复杂一些＃34;我正在寻找优化以下问题的方法：

我有一个包含N行（数亿）数据的文本文件和几列。问题在于，由于某种原因，第1列具有索引，而其他列具有值，如下所示：

1  2.3  4.7
2  2.8  2.4
1  1.9  3.1
2  6.7  3.1
... # and so forth (first column = index, thousands of unique indexes)

所以我想要的是读取这些文件并将它们连接起来，然后用每个唯一索引选择所有行，并将它们放在每列的单独向量中。以上将是：

# Vector 1
1  2.3  4.7
1  1.9  3.1
... # and so on
# Vector 2
2  2.8  2.4
2  6.7  3.1
... # and so on

我有一个有效的解决方案，但需要花费很多时间，所以我正在寻找改进它的方法，因此标题（它是一个索引问题）。我正在寻找使用任何包装的解决方案，但我想大熊猫是一个很好的候选人。以下是我当前的代码（相关部分）。

# Load data
data = pd.concat([pd.read_csv(path,sep=r'\t',header=None,engine='python') for f in files])
# Sort data
for col in columns:
    d_dict[name][col] = [data[col][data[0] == i] for i in range(min,max+1)] # range min/max is the min/max of possible index values in column 1

数据的加载和数据的排序都需要花费很多时间，但它会像我想要的那样格式化数据，而且我认为它还保留了加载的原始数据中行的原始顺序（如果这个假设是错误的，请告诉我：p）。

我希望你有任何好的想法如何加快这个过程，因为现在只需要40分钟就可以做到这一点，而这只是我必须处理的数据量的一个样本。最终数据集的大小约为10倍。但是，我只使用了不到20％的系统内存，所以我有空间在那里工作（但如果需要，我可以转储一些数据）。我也可以考虑将它并行。

干杯！

Answer 1

您可以对第一列进行argsort，并使用结果对其他列进行花式索引。

由于你的指数不是太大的整数，我们可以使用一个技巧来获得argsort，我相信O（n）。

>>> from scipy import sparse
>>> import numpy as np
>>> 
# mock first column
>>> idx = np.random.randint(5_000, 15_000, (50_000_000,))
>>> 
# construct sparse one-hot matrix and convert from csr to csc
# for this conversion scipy must stably argsort the column indices
# but because it can exploit certain properties of the index set
# this is faster than using argsort directly
>>> imn, imx = idx.min(), idx.max()+1
>>> rng = np.arange(idx.size + 1)
>>> spM = sparse.csr_matrix((rng[:-1], idx-imn, rng), (idx.size, imx-imn)).tocsc()
>>> 
# extract the sorting index and the group boundaries
>>> sidx, bnds = spM.indices, spM.indptr
>>>
# use them to extract the groups, here we are using the first column
# itself as an example, the result will - sanity check - be groups
# consisting of copies of the group id 
# in practice, you would use another column in place of `idx` below
>>> groups = np.split(idx[sidx], bnds[1:-1])
>>> groups
# [array([5000, 5000, 5000, ..., 5000, 5000, 5000]), array([5001, 5001, 5001, ..., 5001, 5001, 5001]), array([5002, 5002, 5002, ..., 
#
# ... VERY long list

在python（native，pandas，numpy）中提取基于索引的值/行的最有效方法？

1 个答案: