Question

由于没有足够的内存来存储，我正在分块读取一个大型的csv文件。我想读取其前10行（0至9行），跳过接下来的10行（10至19行），然后阅读接下来的10行（20至29行），再次跳过接下来的10行（30至39行）），然后读取40到49之间的行，依此类推。以下是我正在使用的代码：

#initializing n1 and n2 variable  
n1=1
n2=2
#reading data in chunks
for chunk in pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=list(range(  ((n1*10)+1), ((n2*10) +1) ))):
    sample_chunk=chunk
   #displaying the  sample_chunk
   print(sample_chunk)
   #incrementing n1
    n1=n1+2
   #incrementing n2
    n2=n2+2

但是，该代码无法正常运行，因为我认为我已经设计好了。它仅跳过10到19的行（即：它读取0到9的行，跳过10到19，然后读取20到29，然后再次读取30到39，然后再次读取40到49，并继续读取所有行）。请帮助我确定我在做什么错。

Answer 1

使用您的方法，您需要在初始化skiprows时定义所有pd.read_csv，

rowskips = [i for x in range(1,int(lengthOfFile/10),2) for i in range(x*10, (x+1)*10)]

，其中lengthOfFile是文件的长度。

然后pd.read_csv

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=rowskips)

从文档中：

skiprows : list-like, int or callable, optional

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

因此您可以传递list，int或callable，

int->跳过文件开头的给定行
list->跳过list中给出的行号
callable->用callable评估行号，然后决定是否跳过。

您正在传递list，该值在启动时指定要跳过的行。您无法再次更新。另一种方法可能是传递可调用的lamda x: x in rowskips，它将评估行是否适合要跳过的条件。

Answer 2

代码：

ro = list(range(0, lengthOfFile + 10, 10))
d = [j + 1 for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
# print(ro)
print(d)

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=d)

例如：

lengthOfFile = 100
ro = list(range(0, lengthOfFile + 10, 10))
d = [j for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
print(d)

输出： [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

使用pandas.read_csv

2 个答案: