在这个例子中,我们有两天的数据采样,分辨率为1分钟,给我们2880次测量。测量是在多个时区顺序收集的:欧洲/伦敦的前240分钟和美国/洛杉矶的剩余2640个测量值。
import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC')))
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London'
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles'
df['sales1']=np.random.random_integers(100,size=len(df))
df['sales2']=np.random.random_integers(10,size=len(df))
要计算多天内24小时制的每分钟平均销售额(按UTC时间),以下方法可以很好地运作:
utc_sales=df.groupby([df.index.hour,df.index.minute]).mean()
utc_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True)
此groupby方法也可用于根据其他两个时区之一计算平均销售额,例如“欧洲/伦敦”。
df['London']=df.index.tz_convert('Europe/London')
london_sales=df.groupby([df['London'].dt.hour,df['London'].dt.minute]).mean()
london_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True)
然而,我很难想出一种有效的方法来计算每分钟的平均销售额 - 按当地时间计算 - 在24小时内。我从上面尝试了相同的方法,但是当同一系列中存在多个时区时,groupby将恢复为utc中的索引。
def calculate_localtime(x):
return pd.to_datetime(x.name,unit='s').tz_convert(x['timezone'])
df['localtime']=df.apply(calculate_localtime,axis=1)
local_sales=df.groupby([df['localtime'].dt.hour,df['localtime'].dt.minute]).mean()
local_sales.set_index(pd.date_range("00:00","23:59",freq="1min").time,inplace=True)
我们可以验证local_sales是否与utc_sales相同,因此这种方法不起作用。
In [8]: np.unique(local_sales == utc_sales)
Out[8]: array([ True], dtype=bool)
有人可以推荐适合大型数据集和多个时区的方法吗?
答案 0 :(得分:2)
这是一种获得我认为你想要的方法。这需要pandas 0.17.0
创建数据
import pandas as pd
import numpy as np
pd.options.display.max_rows=12
np.random.seed(1234)
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC')))
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London'
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles'
df['sales1']=np.random.random_integers(100,size=len(df))
df['sales2']=np.random.random_integers(10,size=len(df))
In [79]: df
Out[79]:
timezone sales1 sales2
2015-03-29 00:00:00+00:00 Europe/London 48 6
2015-03-29 00:01:00+00:00 Europe/London 84 1
2015-03-29 00:02:00+00:00 Europe/London 39 1
2015-03-29 00:03:00+00:00 Europe/London 54 10
2015-03-29 00:04:00+00:00 Europe/London 77 5
2015-03-29 00:05:00+00:00 Europe/London 25 9
... ... ... ...
2015-03-30 23:54:00+00:00 America/Los_Angeles 77 8
2015-03-30 23:55:00+00:00 America/Los_Angeles 16 4
2015-03-30 23:56:00+00:00 America/Los_Angeles 55 3
2015-03-30 23:57:00+00:00 America/Los_Angeles 18 1
2015-03-30 23:58:00+00:00 America/Los_Angeles 3 2
2015-03-30 23:59:00+00:00 America/Los_Angeles 52 2
[2880 rows x 3 columns]
根据时区进行透视;这会创建一个时区分开的多索引
x = pd.pivot_table(df.reset_index(),values=['sales1','sales2'],index='index',columns='timezone').swaplevel(0,1,axis=1)
x.columns.names = ['timezone','sales']
In [82]: x
Out[82]:
timezone America/Los_Angeles Europe/London America/Los_Angeles Europe/London
sales sales1 sales1 sales2 sales2
index
2015-03-29 00:00:00+00:00 NaN 48 NaN 6
2015-03-29 00:01:00+00:00 NaN 84 NaN 1
2015-03-29 00:02:00+00:00 NaN 39 NaN 1
2015-03-29 00:03:00+00:00 NaN 54 NaN 10
2015-03-29 00:04:00+00:00 NaN 77 NaN 5
2015-03-29 00:05:00+00:00 NaN 25 NaN 9
... ... ... ... ...
2015-03-30 23:54:00+00:00 77 NaN 8 NaN
2015-03-30 23:55:00+00:00 16 NaN 4 NaN
2015-03-30 23:56:00+00:00 55 NaN 3 NaN
2015-03-30 23:57:00+00:00 18 NaN 1 NaN
2015-03-30 23:58:00+00:00 3 NaN 2 NaN
2015-03-30 23:59:00+00:00 52 NaN 2 NaN
[2880 rows x 4 columns]
创建我们想要使用的石斑鱼,即本地区域的小时和分钟。我们将根据面具填充它们,IOW。如果sales1 / sales2都不是空的,我们将使用该(本地)区域的小时/分钟
hours = pd.Series(index=x.index)
minutes = pd.Series(index=x.index)
for tz in ['America/Los_Angeles', 'Europe/London' ]:
local = df.index.tz_convert(tz)
x[(tz,'tz')] = local
mask = x[(tz,'sales1')].notnull() & x[(tz,'sales2')].notnull()
hours.iloc[mask.values] = local.hour[mask.values]
minutes.iloc[mask.values] = local.minute[mask.values]
x = x.sortlevel(axis=1)
以上之后。 (注意这可能有点简化,这意味着我们不需要实际记录本地时区,只计算小时/分钟)。
Out[84]:
timezone America/Los_Angeles Europe/London
sales sales1 sales2 tz sales1 sales2 tz
index
2015-03-29 00:00:00+00:00 NaN NaN 2015-03-28 17:00:00-07:00 48 6 2015-03-29 00:00:00+00:00
2015-03-29 00:01:00+00:00 NaN NaN 2015-03-28 17:01:00-07:00 84 1 2015-03-29 00:01:00+00:00
2015-03-29 00:02:00+00:00 NaN NaN 2015-03-28 17:02:00-07:00 39 1 2015-03-29 00:02:00+00:00
2015-03-29 00:03:00+00:00 NaN NaN 2015-03-28 17:03:00-07:00 54 10 2015-03-29 00:03:00+00:00
2015-03-29 00:04:00+00:00 NaN NaN 2015-03-28 17:04:00-07:00 77 5 2015-03-29 00:04:00+00:00
2015-03-29 00:05:00+00:00 NaN NaN 2015-03-28 17:05:00-07:00 25 9 2015-03-29 00:05:00+00:00
... ... ... ... ... ... ...
2015-03-30 23:54:00+00:00 77 8 2015-03-30 16:54:00-07:00 NaN NaN 2015-03-31 00:54:00+01:00
2015-03-30 23:55:00+00:00 16 4 2015-03-30 16:55:00-07:00 NaN NaN 2015-03-31 00:55:00+01:00
2015-03-30 23:56:00+00:00 55 3 2015-03-30 16:56:00-07:00 NaN NaN 2015-03-31 00:56:00+01:00
2015-03-30 23:57:00+00:00 18 1 2015-03-30 16:57:00-07:00 NaN NaN 2015-03-31 00:57:00+01:00
2015-03-30 23:58:00+00:00 3 2 2015-03-30 16:58:00-07:00 NaN NaN 2015-03-31 00:58:00+01:00
2015-03-30 23:59:00+00:00 52 2 2015-03-30 16:59:00-07:00 NaN NaN 2015-03-31 00:59:00+01:00
[2880 rows x 6 columns]
这使用时区的新表示(在0.17.0中)。
In [85]: x.dtypes
Out[85]:
timezone sales
America/Los_Angeles sales1 float64
sales2 float64
tz datetime64[ns, America/Los_Angeles]
Europe/London sales1 float64
sales2 float64
tz datetime64[ns, Europe/London]
dtype: object
结果
x.groupby([hours,minutes]).mean()
timezone America/Los_Angeles Europe/London
sales sales1 sales2 sales1 sales2
0 0 62.5 5.5 48 6
1 52.0 7.0 84 1
2 89.0 3.5 39 1
3 67.5 6.5 54 10
4 41.0 5.5 77 5
5 81.0 5.5 25 9
... ... ... ... ...
23 54 76.5 4.5 NaN NaN
55 37.5 5.0 NaN NaN
56 60.5 8.0 NaN NaN
57 87.5 7.0 NaN NaN
58 77.5 6.0 NaN NaN
59 31.0 5.5 NaN NaN
[1440 rows x 4 columns]