Question

我有一个非常大的数据帧（大约1.1M行），我正在尝试对其进行采样。

我有一个索引列表（约70,000个索引），我想从整个数据框中选择。

这是我到目前为止所尝试的，但所有这些方法都花费了太多时间：

方法1 - 使用pandas：

sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]

方法2：

我尝试将所有采样行写入另一个csv。

f = open("data.csv",'r')

out  = open("sampled_date.csv", 'w')
out.write(f.readline())

while 1:
    total += 1
    line = f.readline().strip()

    if line =='':
        break
    arr = line.split(",")

    if (int(arr[0]) in sample_index_array):
        out.write(",".join(e for e in (line)))

有人可以建议更好的方法吗？或者我如何修改它以使其更快？

由于

Answer 1

我们没有您的数据，因此这里有两个选项的示例：

：使用pandas Index对象通过.iloc selection method

选择子集阅读时
：带有skiprows parameter
的谓词
<强>鉴于

一组索引和一个（大）样本DataFrame写入test.csv：

import pandas as pd import numpy as np indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776] df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD")) df.to_csv("test.csv", header=False) df.info()

输出

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 4 columns): A 1000000 non-null int32 B 1000000 non-null int32 C 1000000 non-null int32 D 1000000 non-null int32 dtypes: int32(4) memory usage: 15.3 MB

<强>代码

选项1 - 阅读后

将索引的样本列表转换为Index对象并对加载的DataFrame进行切片：

idxs = pd.Index(indices) subset = df.iloc[idxs, :] print(subset)

.iat and .at methods甚至更快，但需要标量索引。

选项2 - 阅读时（推荐）

我们可以编写一个谓词，在文件被读取时保持选定的索引（更有效）：

pred = lambda x: x not in indices data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD") print(data)

另见issue that led to extending skiprows。

<强>结果

后一种选择产生相同的输出：

A B C D 1 74 95 28 4 2 87 3 49 94 3 53 54 34 97 10 58 41 48 15 20 86 20 92 11 30 36 59 22 5 67 49 23 86 63 78 98 63 60 75 900 26 11 71 85 2176 12 73 58 91 78776 42 30 97 96

从python中的大型数据帧快速采样大量行

1 个答案: