我有一个人口如下的数据框 -
RegionName State 2000-01 2000-02 2000-03 2000-04 ... 2016-10 2016-11 2016-12
0 New York NY 204 300 300 124 ... 456 566 344
1 Mountain View CA 204 300 300 124 ... 456 566 344
数据集中有近~10K rows
个。对于此数据集,我想从2000 to 2016
为每个季度添加平均人口列。
我在apply
向数据框写了一个函数,如下所示 -
import numpy as np
def quarterize(row):
quarter_to_months_map = {
'q1' : ['01', '02', '03'],
'q2' : ['04', '05', '06'],
'q3' : ['07', '08', '09'],
'q4' : ['10', '11', '12']
}
for year in range(2000, 2017):
year = '{}'.format(year)
for quarter in quarter_to_months_map.keys():
values = []
for month in quarter_to_months_map[quarter]:
values.append(row['{}-{}'.format(year, month)])
row['{}{}'.format(year, quarter)] = np.nanmean(values)
return row
df = df.apply(quarterize, axis = 1)
这样可以正常但较小的数据集但是~10K
数据集,这将需要~10 min
。有没有办法让这个更高效,更快?
答案 0 :(得分:1)
是。永远不要在行上操作,而是在列上操作。
有些事情:
import numpy as np
import pandas as pd
import random
df = pd.DataFrame([[random.randint(150, 300) for x in range(12 * 10)] for _ in range(1000)],
columns=['{}-{:02d}'.format(year, month) for month in range(1,13) for year in range(2000, 2010)])
quarter_to_months_map = {
'q1' : ['01', '02', '03'],
'q2' : ['04', '05', '06'],
'q3' : ['07', '08', '09'],
'q4' : ['10', '11', '12']
}
for year in range(2000, 2010):
for quarter, months in quarter_to_months_map.items():
months = ['{}-{}'.format(year, month) for month in months]
df['{}{}'.format(year, quarter)] = df[months].mean(axis=1)