我试图在熊猫中联接(或合并)两个数据帧,这些数据帧是按时间索引的,但是我的代码将表累积在内存中。
每个文件只有1 MB,再经过几个文件,计算机将耗尽内存。
如何执行此操作?
用于联接表(左,右,内部,外部...)的方法不会改变性能问题。
import pandas as pd
from glob import glob
filenames = glob('*.txt')
filename = filenames[0]
varname = filename[:-11] # removes three last char extension from string
print('Sampling', filename)
data = pd.read_csv(filename, sep=';', skiprows=3, names=['time', varname ],
index_col=0)
for filename in filenames[1:]:
print('Sampling', filename)
varname = filename[:-11] # removes three last char extension from string
data_new = pd.read_csv(filename, sep=';', skiprows=3, names=['time', varname ],
index_col=0)
#data = pd.DataFrame.join(data, data_new, how='outer', on='time')
data = pd.DataFrame.merge(data_new, data, how='outer', on='time', copy=False)
这些是运行三个文件的代码后的data.head()和data.tail()。
var1 var2 var3 var4 var5
time
01/01/2016 07:00:00 13.3781 6.95406 NaN 87.6588 71.5696
01/01/2016 08:00:00 13.2312 6.89561 NaN 87.6221 71.6038
01/01/2016 09:00:00 13.2774 6.90632 NaN 87.2595 71.4383
01/01/2016 10:00:00 13.6152 7.02360 NaN 87.2028 71.4482
01/01/2016 11:00:00 13.5584 7.00147 NaN 87.3733 71.3335
...
var1 var2 var3 var4 var5
time
01/01/2019 02:00:00 15.8096 28.2316 NaN 87.5106 68.6665
01/01/2019 03:00:00 15.8352 28.1616 NaN 87.7226 69.0639
01/01/2019 04:00:00 15.6879 27.6819 NaN 87.1135 68.6873
01/01/2019 05:00:00 15.6558 27.7961 NaN 87.4658 69.1395
01/01/2019 06:00:00 15.7383 28.1330 NaN 87.5775 68.8240