Question

我有一个非常大的csv文件（5 GB），所以我不想将整个内容加载到内存中，我想删除一个或多个列。我尝试在blaze中使用以下代码，但它所做的就是将结果列附加到现有的csv文件中：

from blaze import Data, odo
d = Data("myfile.csv")
d = d[columns_I_want_to_keep]
odo(d, "myfile.csv")

有没有办法使用pandas或blaze来保留我想要的列并删除其他列？

Answer 1

你可以使用dask.dataframe，它在语法上类似于熊猫，但是在核心之外进行操作，因此内存不应成为问题。它还会自动并行处理流程，因此它应该很快。

import dask.dataframe as dd

df = dd.read_csv('myfile.csv', usecols=['col1', 'col2', 'col3'])
df.to_csv('output.csv', index=False)

<强>计时

我已经将目前发布的每个方法定时在1.4 GB的csv文件中。我保留了四列，输出csv文件为250 MB。

使用Dask：

%%timeit
df = dd.read_csv(f_in, usecols=cols_to_keep)
df.to_csv(f_out, index=False)

1 loop, best of 3: 41.8 s per loop

使用熊猫：

%%timeit
chunksize = 10**5
for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
    chunk.to_csv(f_out, mode='a', index=False)

1 loop, best of 3: 44.2 s per loop

使用Python / CSV：

%%timeit
inc_f = open(f_in, 'r')
csv_r = csv.reader(inc_f)
out_f = open(f_out, 'w')
csv_w = csv.writer(out_f, delimiter=',', lineterminator='\n')
for row in csv_r:
    new_row = [row[1], row[5], row[6], row[8]]
    csv_w.writerow(new_row)
inc_f.close()
out_f.close()

1 loop, best of 3:  1min 1s per loop

Answer 2

我会这样做：

cols2keep = ['col1','col3','col4','col6'] # columns you want to have in the resulting CSV file
chunksize = 10**5  # you may want to adjust it ... 
for chunk in pd.read_csv(filename, chunksize=chunksize, usecols=cols2keep):
    chunk.to_csv('output.csv', mode='a', index=False)

PS如果适合您，您可能还想考虑从CSV迁移到PyTables（HDF5）......

Answer 3

我经常处理大型csv文件。这是我的解决方案：

import csv
fname_in = r'C:\mydir\myfile_in.csv' 
fname_out = r'C:\mydir\myfile_out.csv' 
inc_f = open(fname_in,'r')  #open the file for reading
csv_r = csv.reader(inc_f) # Attach the csv "lens" to the input stream - default is excel dialect
out_f = open(fname_out,'w') #open the file for writing
csv_w = csv.writer(out_f, delimiter=',',lineterminator='\n' ) #attach the csv "lens" to the stream headed to the output file
for row in csv_r: #Loop Through each row in the input file
    new_row = row[:]  # initialize the output row
    new_row.pop(5) #Whatever column you wanted to delete
    csv_w.writerow(new_row) 
inc_f.close()
out_f.close()

Answer 4

每次将新块保存到磁盘时，按块读取原始CSV并将其附加到新文件将打印标题。可以通过以下方式避免这种情况：

    public function bonusOffer(Request $request)
    {
        $adLocation = $request->server('HTTP_REFERER');
        bla bla bla
    }

使用pandas或blaze从非常大的CSV文件中删除列

4 个答案: