Question

我有一些（大约60个）巨大的（> 2 gig）CSV文件，我想循环以进行子选择（例如，每个文件包含1个月的各种金融产品的数据，我想制作60-每个产品的月份时间序列。）

将整个文件读入内存（例如，通过在excel或matlab中加载文件）是行不通的，所以我在stackoverflow上的初始搜索让我尝试使用python。我的策略是迭代地遍历每一行并将其写在某个文件夹中。这种策略运行良好，但速度极慢。

根据我的理解，在内存使用和计算速度之间存在权衡。将整个文件加载到内存中的是光谱的一端（计算机崩溃），每次加载一行到内存中显然是另一端（计算时间约为5小时）。

所以我的主要问题是： *有没有办法将多行加载到内存中，以便更快地执行此过程（100次？）。虽然没有失去功能？ * 如果是这样，我将如何实现？或者我这样做是错的？请注意，下面只是我想要做的简化代码（我可能希望在其他维度上进行子选择而不是时间）。假设原始数据文件没有有意义的排序（除了每个月分成60个文件）。

我正在尝试的方法是：

#Creates a time series per bond
import csv
import linecache


#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year


filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];


for j in range(0,5):
     for i in range(0,12):    
         filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')




for x in range (0,60):
line = linecache.getline('allBonds.txt', x)  
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:

     for line in text_file:

          temp=line.split(';');

          if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
               output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
               datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
               datawriter.writerow(temp)
               output_file.close()

提前致谢。

P.S。只是为了确保：代码目前正常工作（当然欢迎任何建议），但问题是速度。

Answer 1

我会测试pandas.read_csv中提到的https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file。它支持以块（if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a') datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL) datawriter.writerow(temp) output_file.close()选项）

读取文件

我认为如果条件频繁匹配，代码的这一部分可能会导致严重的性能问题。

  String s = "";
  if(getIntent().getExtras() != null)
   {
        s = getIntent().getExtras().getString("testString");
        Log.v(SecondActivity.class.getSimpleName(), s);
   }

最好避免打开文件，创建cvs.writer（）对象，然后在循环中关闭文件。

在python

1 个答案: