Question

我有一个数据框，其中包含时区列和日期时间列。我想首先将它们转换为UTC以与其他数据连接，然后我将进行一些计算，最终将UTC转换为查看者本地时区。

datetime              time_zone
2016-09-19 01:29:13   America/Bogota 
2016-09-19 02:16:04   America/New_York
2016-09-19 01:57:54   Africa/Cairo

def create_utc(df, column, time_format='%Y-%m-%d %H:%M:%S'):
    timezone = df['TZ']
    df[column + '_utc'] = df[column].dt.tz_localize(timezone).dt.tz_convert('UTC').dt.strftime(time_format)
    df[column + '_utc'].replace('NaT', np.nan, inplace=True)
    df[column + '_utc'] = pd.to_datetime(df[column + '_utc'])
    return df

这是我的有缺陷的尝试。错误是事实是模棱两可的，这是有道理的，因为时区＆＃39;变量指的是一列。如何引用同一行中的值？

编辑：以下是一天数据（394,000行和22个独特时区）下面的答案的一些结果。 Edit2：我添加了一个groupby示例，以防有人想看到结果。到目前为止，这是最快的。

%%timeit

for tz in df['TZ'].unique():
    df.ix[df['TZ'] == tz, 'datetime_utc2'] = df.ix[df['TZ'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc2'] = df['datetime_utc2'].dt.tz_localize(None)

1 loops, best of 3: 1.27 s per loop

%%timeit

df['datetime_utc'] = [d['datetime'].tz_localize(d['TZ']).tz_convert('UTC') for i, d in df.iterrows()]
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)

1 loops, best of 3: 50.3 s per loop

df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('TZ')])



**1 loops, best of 3: 249 ms per loop**

Answer 1

这是一个矢量化方法（它将循环df.time_zone.nunique()次）：

In [2]: t
Out[2]:
             datetime         time_zone
0 2016-09-19 01:29:13    America/Bogota
1 2016-09-19 02:16:04  America/New_York
2 2016-09-19 01:57:54      Africa/Cairo
3 2016-09-19 11:00:00    America/Bogota
4 2016-09-19 12:00:00  America/New_York
5 2016-09-19 13:00:00      Africa/Cairo

In [3]: for tz in t.time_zone.unique():
   ...:         mask = (t.time_zone == tz)
   ...:         t.loc[mask, 'datetime'] = \
   ...:             t.loc[mask, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
   ...:

In [4]: t
Out[4]:
             datetime         time_zone
0 2016-09-19 06:29:13    America/Bogota
1 2016-09-19 06:16:04  America/New_York
2 2016-09-18 23:57:54      Africa/Cairo
3 2016-09-19 16:00:00    America/Bogota
4 2016-09-19 16:00:00  America/New_York
5 2016-09-19 11:00:00      Africa/Cairo

<强>更新

In [12]: df['new'] = df.groupby('time_zone')['datetime'] \
                       .transform(lambda x: x.dt.tz_localize(x.name))

In [13]: df
Out[13]:
             datetime         time_zone                 new
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00    America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00  America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00      Africa/Cairo 2016-09-19 11:00:00

Answer 2

您的问题是tz_localize()只能采用标量值，因此我们必须遍历DataFrame：

df['datetime_utc'] = [d['datetime'].tz_localize(d['time_zone']).tz_convert('UTC') for i,d in df.iterrows()]

结果是：

            datetime         time_zone              datetime_utc
0 2016-09-19 01:29:13    America/Bogota 2016-09-19 06:29:13+00:00
1 2016-09-19 02:16:04  America/New_York 2016-09-19 06:16:04+00:00
2 2016-09-19 01:57:54      Africa/Cairo 2016-09-18 23:57:54+00:00

另一种方法是按时区分组并在一次传递中转换所有匹配的行：

df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('time_zone')])

Pandas将日期时间转换为单独的时区列

2 个答案: