Question

我正在进行大型数据集分析。由于处理时间（15天！）我遇到了问题。大部分时间花在从文本文件导入数据上。我认为这部分可以优化，但我不知道如何。目前我正在导入：

def lines(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

def importdata(x, y, filename2):
    ts = time.time()    
    nfiles = lines(filename2)+1  
    counter = 0
    search = '[' + str(x) + ', ' + str(y) + ']'
    output_list = []
    print "test" ,time.time()-ts
    ts = time.time()
    with open(filename2) as fobj:
        for counter, line in enumerate(fobj, 1):
            if search in line:
                coordinate = json.loads(next(fobj))
        output_list.extend(coordinate)
        if counter == nfiles :
            output = "Bad pixel"
        else:
            output = output_list
print "time", time.time()-ts
return output

我有以下文件格式：

[12,512]
[51, 64, 85,12, 23]
[13, 45]
[83, 27, 28, 19, 3]
[17, 35]
[54, 78, 38, 19, 2]
[12, 512]
[23, 65, 6, 5, 45]

例如，运行importdata（12,512，filename2）应该返回

[51,64,85,12,23,23,65,6,5,45]

如果x，y不在文件输出中，则应返回＆＃34;坏像素＆＃34;。

文件约为1g，我按如下方式读取输出

x，y，文件1 x，y，文件2 x，y，文件3 x2，y2，文件1 x2，y2，文件2

非常感谢！

将数据从大文件导入列表

0 个答案: