在python中处理一个巨大的.csv文件

时间:2015-03-13 20:31:06

标签: python csv

作为一个更大问题的一部分,我正在研究我必须在一组.csv文件中读取并操纵它们,生成一组新的.csv文件。除了一个文件外,一切都很顺利: voltvalues.csv 。该文件的内容如下所示:

... 13986513,6,6/1/2014 12:00:00 AM,248.7 13986513,6,6/1/2014 12:00:05 AM,248.4 13986513,6,6/1/2014 12:00:10 AM,249 13986513,6,6/1/2014 12:00:15 AM,249.3 13986513,6,6/1/2014 12:00:20 AM,249.3 13986513,6,6/1/2014 12:00:25 AM,249.3 ... 13986513,6,6/30/2014 11:55:00 PM,249.3 13986534,6,6/1/2014 12:00:00 AM,249 13986534,6,6/1/2014 12:00:05 AM,249 13986534,6,6/1/2014 12:00:10 AM,249.3 13986534,6,6/1/2014 12:00:15 AM,249.6 ... 13986534,6,6/30/2014 11:55:00 PM,249.7 ...

我正在尝试吐出另一个.csv文件: newvolt.csv ,其中包含以下格式的数据: timestamp,13986513,13986534,... 2014-06-01 12:00:00 PDT,248.7,249.3,... 2014-06-01 12:00:05 PDT,248.4,249,... ... 2014-06-30 23:55:00 PDT,249.3,249.7,... 使用此文件的问题是 voltvalues.csv 的大小:6GB(大量的10亿行和4列)。所以我正在阅读的方式是这样的:

#meters=[]
real_recorder  = open("newvolt.csv",'w')
with open("voltvalues.csv",'rb') as voltfile:
    voltread = csv.reader(voltfile)
    next(voltread)#skip header
    for line in voltread:
        #convert the data of voltvalues.csv into the format I desire
        #BEST WAY to do it?
        real_recorder.writelines([...])
        #meters.append(line[0])
#print len(meters)
#print len(set(meters))

我知道python的datetime模块有一些方法可以将一个 datetime 格式更改为其他格式,但在这种情况下,它在内存方面非常昂贵。关于进行整个转换的最佳方法的任何建议?

2 个答案:

答案 0 :(得分:1)

您可以扫描文件并记录每个传感器的起始偏移量。要读取给定传感器的下一个值,请寻找该偏移量,读取一条线并更新偏移量。使用这种方法,您不需要在本地内存中保留尽可能多的数据,但是您依靠操作系统RAM缓存来提高性能。这可能是使用内存映射文件的好地方。

如果传感器没有相同的时间值,那就变得更复杂了,但这是一个开始:

open('data.csv','w').write(
"""\
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
13986534,6,6/30/2014 11:55:00 PM,249.7\
""")

class ReadSensorLines(object):

    def __init__(self, filename):

        sensor_offsets = {}
        sensors = []

        readfp = open(filename, "rb")
        readfp.readline() # skip header

        # find start of each sensor
        # use readline not iteration so that tell offset is right

        offset = readfp.tell()
        sensor = ''

        while True:
            line = readfp.readline()
            if not line:
                break
            next_sensor = line.split(',', 1)[0]
            if next_sensor != sensor:
                if sensor:
                    sensors.append(sensor)
                    next_offset = readfp.tell()
                    sensor_offsets[sensor] = [offset, next_offset - offset]
                    sensor = next_sensor
                    offset = next_offset
                else:
                    # setup for first sensor
                    sensor = next_sensor
        if next_sensor:
            sensors.append(next_sensor)
            sensor_offsets[next_sensor] = [offset, readfp.tell() - offset]

        self.readfp = readfp
        self.sensor_offsets = sensor_offsets
        self.sensors = sensors

    def read_sensor(self, sensorname):
        pos_data = self.sensor_offsets[sensorname]
        self.readfp.seek(pos_data[0])
        line = self.readfp.readline(pos_data[1])
        pos_data[0] += len(line)
        pos_data[1] -= len(line)
        return line

    @property
    def data_remains(self):
        return any(pos_data[1] for pos_data in self.sensor_offsets.itervalues())

    def close(self):
        self.readfp.close()


sensor_lines = ReadSensorLines("data.csv")
while sensor_lines.data_remains:
    row = []
    for sensor in sensor_lines.sensors:
        sensor_line = sensor_lines.read_sensor(sensor)
        if sensor_line:
            _, _, date, volts = sensor_line.strip().split(',')
            row.append(volts)
        else:
            row.append('')
    row.insert(0, date)
    row[0] = str(datetime.datetime.strptime(row[0],'%m/%d/%Y %H:%M'))
    print ','.join(row)

答案 1 :(得分:0)

似乎文件中的信息已按正确的顺序排序。你能做的是:

写下这些"时间戳,13986513,13986534,..."到文件" timestamp.txt"。 然后抓住价值" 13986513,6,6 / 1/2014 12:00:00"在全局字符串中。然后写一行文件" volt.csv"。

每次" 13986513,6,6 / 1/2014 12:00:00"匹配你可以添加它以创建的前一个" 2014-06-01 12:00:00 PDT,248.7,249.3,..."。

但每次你读1行并读掉那一行。如果你把它留在你的记忆中,那么程序就不能再处理了。

看看flush()函数。我想你可能需要那个。

编辑:

示例代码:

class Ice():
    def __init__(self):
        self.Fire()
    def Fire(self):
        with open('File1.txt', 'r') as file1:
            for line in file1:
                # Do something. 
                print('Do something....')
                # Save to file. 
                with open('File2.txt', 'a') as file2:
                    file2.write(line)
                    file2.flush()
                    file2.close()
        file1.close()

# Run class 
Ice()
大文件的东西是,是用了很多内存。所以你想要的是从文件中读取一行。处理它。把它写出来(从内存中取出)并取下一行。这样您就可以处理大量文件。

.flush()的作用是,是否刷新输出。就像在我的例子中你写了一行,但python不会在write()的那一刻写出来。它将其存储在内存中。通过使用.flush(),输出将被写入文件。并且该行不会保留在内存中。

通过创建临时文件,您可以使用其内存的最大值来处理所有没有python的行。