我正在阅读几个大型(~700mb)CSV文件以转换为数据帧,这些数据帧将全部合并为一个CSV。现在,每个CSV都是每个CSV中date
列的索引。所有CSV都有重叠日期,但具有独特的测试位置。每个CSV都以其测试位置命名(例如,BER和ALT测试站点的ber.csv和alt.csv)。我怎么能像这样多索引?现在我有:
def openFile(filesToProcess):
df1 = pd.DataFrame()
counter = 0
for input in filesToProcess:
base = os.path.splitext(basename(input))[0]
print "Working on %s" % base
with open(input, 'r') as input_file:
#row_count = sum(1 for row in input_file)
if counter == 0:
df1 = createDataFrame(input_file)
else:
df2 = createDataFrame(input_file)
df1 = pd.concat([df1,df2])
counter += 1
input_file.close()
df1.to_csv('large.csv')
def createDataFrame(input_file):
checkTime = time.clock()
#print "Start DataFrame -- #%d" % counter
df1 = pd.read_csv(input_file,
sep = ",",
nrows = 500,
index_col = ['Date'])
#print "End DataFrame -- #%d" % counter
#print "Ran for " + str(time.clock() - checkTime) + " Seconds"
return df1
所以我想要
date, testsite, data1, data2
1/1/1992 9:15:00, ber, 89, 200
1/1/1992 9:17:00, ber, 54, 103.3
1/1/1992 9:15:00, alt, 90, 109.23
1/1/1992 9:17:00, alt, 12, 110.1
其中date
和testsite
是多索引
答案 0 :(得分:0)
设置
ber_df = pd.DataFrame([[89, 200], [54, 103.3]],
pd.DatetimeIndex(['1/1/1992 9:15:00', '1/1/1992 9:17:00'],
name='date'),
['data1', 'data2'])
alt_df = pd.DataFrame([[90, 109.23], [12, 110.1]],
pd.DatetimeIndex(['1/1/1992 9:15:00', '1/1/1992 9:17:00'],
name='date'),
['data1', 'data2'])
ber_df.to_csv('ber.csv')
alt_df.to_csv('alt.csv')
解决方案
filesToProcess = ['ber.csv', 'alt.csv']
def parse_file(fn):
return pd.read_csv(fn, index_col=0, parse_dates=[0])
pd.concat({fn.replace('.csv', ''): parse_file(fn) for fn in filesToProcess}) \
.rename_axis(['testsite', 'date'], axis=0).swaplevel(0, 1).reset_index()