作为一个更大问题的一部分,我正在研究我必须在一组.csv文件中读取并操纵它们,生成一组新的.csv文件。除了一个文件外,一切都很顺利: voltvalues.csv 。该文件的内容如下所示:
...
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
...
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
...
13986534,6,6/30/2014 11:55:00 PM,249.7
...
我正在尝试吐出另一个.csv文件: newvolt.csv ,其中包含以下格式的数据:
timestamp,13986513,13986534,...
2014-06-01 12:00:00 PDT,248.7,249.3,...
2014-06-01 12:00:05 PDT,248.4,249,...
...
2014-06-30 23:55:00 PDT,249.3,249.7,...
使用此文件的问题是 voltvalues.csv 的大小:6GB(大量的10亿行和4列)。所以我正在阅读的方式是这样的:
#meters=[]
real_recorder = open("newvolt.csv",'w')
with open("voltvalues.csv",'rb') as voltfile:
voltread = csv.reader(voltfile)
next(voltread)#skip header
for line in voltread:
#convert the data of voltvalues.csv into the format I desire
#BEST WAY to do it?
real_recorder.writelines([...])
#meters.append(line[0])
#print len(meters)
#print len(set(meters))
我知道python的datetime
模块有一些方法可以将一个 datetime 格式更改为其他格式,但在这种情况下,它在内存方面非常昂贵。关于进行整个转换的最佳方法的任何建议?
答案 0 :(得分:1)
您可以扫描文件并记录每个传感器的起始偏移量。要读取给定传感器的下一个值,请寻找该偏移量,读取一条线并更新偏移量。使用这种方法,您不需要在本地内存中保留尽可能多的数据,但是您依靠操作系统RAM缓存来提高性能。这可能是使用内存映射文件的好地方。
如果传感器没有相同的时间值,那就变得更复杂了,但这是一个开始:
open('data.csv','w').write(
"""\
13986513,6,6/1/2014 12:00:00 AM,248.7
13986513,6,6/1/2014 12:00:05 AM,248.4
13986513,6,6/1/2014 12:00:10 AM,249
13986513,6,6/1/2014 12:00:15 AM,249.3
13986513,6,6/1/2014 12:00:20 AM,249.3
13986513,6,6/1/2014 12:00:25 AM,249.3
13986513,6,6/30/2014 11:55:00 PM,249.3
13986534,6,6/1/2014 12:00:00 AM,249
13986534,6,6/1/2014 12:00:05 AM,249
13986534,6,6/1/2014 12:00:10 AM,249.3
13986534,6,6/1/2014 12:00:15 AM,249.6
13986534,6,6/30/2014 11:55:00 PM,249.7\
""")
class ReadSensorLines(object):
def __init__(self, filename):
sensor_offsets = {}
sensors = []
readfp = open(filename, "rb")
readfp.readline() # skip header
# find start of each sensor
# use readline not iteration so that tell offset is right
offset = readfp.tell()
sensor = ''
while True:
line = readfp.readline()
if not line:
break
next_sensor = line.split(',', 1)[0]
if next_sensor != sensor:
if sensor:
sensors.append(sensor)
next_offset = readfp.tell()
sensor_offsets[sensor] = [offset, next_offset - offset]
sensor = next_sensor
offset = next_offset
else:
# setup for first sensor
sensor = next_sensor
if next_sensor:
sensors.append(next_sensor)
sensor_offsets[next_sensor] = [offset, readfp.tell() - offset]
self.readfp = readfp
self.sensor_offsets = sensor_offsets
self.sensors = sensors
def read_sensor(self, sensorname):
pos_data = self.sensor_offsets[sensorname]
self.readfp.seek(pos_data[0])
line = self.readfp.readline(pos_data[1])
pos_data[0] += len(line)
pos_data[1] -= len(line)
return line
@property
def data_remains(self):
return any(pos_data[1] for pos_data in self.sensor_offsets.itervalues())
def close(self):
self.readfp.close()
sensor_lines = ReadSensorLines("data.csv")
while sensor_lines.data_remains:
row = []
for sensor in sensor_lines.sensors:
sensor_line = sensor_lines.read_sensor(sensor)
if sensor_line:
_, _, date, volts = sensor_line.strip().split(',')
row.append(volts)
else:
row.append('')
row.insert(0, date)
row[0] = str(datetime.datetime.strptime(row[0],'%m/%d/%Y %H:%M'))
print ','.join(row)
答案 1 :(得分:0)
似乎文件中的信息已按正确的顺序排序。你能做的是:
写下这些"时间戳,13986513,13986534,..."到文件" timestamp.txt"。 然后抓住价值" 13986513,6,6 / 1/2014 12:00:00"在全局字符串中。然后写一行文件" volt.csv"。
每次" 13986513,6,6 / 1/2014 12:00:00"匹配你可以添加它以创建的前一个" 2014-06-01 12:00:00 PDT,248.7,249.3,..."。
但每次你读1行并读掉那一行。如果你把它留在你的记忆中,那么程序就不能再处理了。
看看flush()函数。我想你可能需要那个。
编辑:
示例代码:
class Ice():
def __init__(self):
self.Fire()
def Fire(self):
with open('File1.txt', 'r') as file1:
for line in file1:
# Do something.
print('Do something....')
# Save to file.
with open('File2.txt', 'a') as file2:
file2.write(line)
file2.flush()
file2.close()
file1.close()
# Run class
Ice()
大文件的东西是,是用了很多内存。所以你想要的是从文件中读取一行。处理它。把它写出来(从内存中取出)并取下一行。这样您就可以处理大量文件。
.flush()的作用是,是否刷新输出。就像在我的例子中你写了一行,但python不会在write()的那一刻写出来。它将其存储在内存中。通过使用.flush(),输出将被写入文件。并且该行不会保留在内存中。
通过创建临时文件,您可以使用其内存的最大值来处理所有没有python的行。