Question

我有将欧洲/布鲁塞尔时间转换为UTC的代码。此代码是否会处理CET和CEST转换？即它是否处理日光节省转换以及UTC？如果没有，有人可以建议如何处理它吗？

df['datetime'] = pd.to_datetime(df['date'] + " " + df['time']).dt.tz_localize('Europe/Brussels').\
     dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

以下数据可在荷兰时间获取。因此，它被转换为UTC。

1/17/2018   1   0:00
1/17/2018   2   0:01
1/17/2018   3   0:02
1/17/2018   4   0:03
1/17/2018   5   0:04
1/17/2018   6   0:05
1/17/2018   7   0:06
1/17/2018   8   0:07

Answer 1

是的，它本身处理DST。检查一下：

import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['2017-08-30 12:00:00', '2017-12-30 12:00:00'])})
df['date'].dt.tz_localize('Europe/Brussels').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

我用DST选择了一个日期，即UTC + 1 + 1，另一个只有UTC + 1（其中+1代表布鲁塞尔）。输出显示，第一个日期转换为减去2小时，而第二个日期减去1小时。

0    2017-08-30 10:00:00
1    2017-12-30 11:00:00

Answer 2

好的，抱歉，如果我将此用作未来自己参考的一种要点:)。但是，虽然@Michal Ficek的答案在技术上是正确的，但在我遇到的数据文件的现实生活中，它通常不适用于我。当我得到一个带有本地时间列的时间序列文件时，就像你的那样，90％的时间我都会得到一条例外。因此，我将检查从夏季到夏季过渡的情况。

理想情况下（至少在没有明确的偏移信息的情况下）你会想看到这样的东西：

#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

但大多数时候你会看到这个：

# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 02:00,2018   # should not exist, so people made up number?
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 00:00,7223
2017-10-29 01:00,7224   # so here is a value missing now
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...

因此，如果您在test_good.csv上使用您的行，您将获得AmbiguousTimeError，但可以使用ambiguous="infer"标志轻松处理：

df_good['utc_time'] = pd.to_datetime(df_good["local_time"]).dt.tz_localize('CET', ambiguous="infer").dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')

然后一切都很好。

但不适用于test_bad.csv：没有标志会导致NonExistentTimeError，因为有一个不应存在的时间戳。因此，请尝试ambiguous="infer"并获得AmbiguousTimeError，因为它不知道如何处理非reapeated时间。这可以通过ambiguous="NaT"修复，再次抛出NonExistentTimeError。是的，完整的圆圈。

到目前为止，我通过手动固定了几次（总是在谷歌搜索各自国家的夏季时间过渡日期）。所以今天早上我用你的问题来提出这个（虽然是hacky）函数：

def add_utc_from_localtime(df, local_time_column='local_time', values=['value']):
    try: # here everything is as expected
        df['utc_time'] = pd.to_datetime(df[local_time_column])
                                .dt.tz_localize('CET', ambiguous="infer")
                                .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
    except AmbiguousTimeError as e: # okay, so he excepts one line to be there twice
        d = re.findall(r'from (.+?) as', str(e))[0] # get the date from error message
        df.loc[df.index[-1] + 1,:] = [d, np.NaN] # add a line with this date at the end
        df = df.sort_values(local_time_column) # sort according to date column
        df[values] = df[values].interpolate() # make up some new value by interpolating
        try:
            df['utc_time'] = pd.to_datetime(df[local_time_column])
                                .dt.tz_localize('CET', ambiguous="infer")
                                .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')        
        except NonExistentTimeError as e2: # okay, now the problem is one date is there twice
            df = df.drop(df[df.local_time == str(e2)].index) # drop it based on error message
            df['utc_time'] = pd.to_datetime(df[local_time_column])
                                .dt.tz_localize('CET', ambiguous="infer")
                                .dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
    return df

当然，这可能会因pandas更新而中断，因为它依赖于错误消息格式。但总是比手动经历数年的数据更好。

下面是一个包含测试数据的完整示例：

import pandas as pd
import numpy as np
from pytz.exceptions import AmbiguousTimeError, NonExistentTimeError
import re

#generate good data
idx = pd.DatetimeIndex(start="1.1.2017",end="01.01.2018",freq="H",closed="left", tz="CET")
df = pd.DataFrame(data=np.arange(0.0,float(len(idx))),index=idx)
df.to_csv("test_good.csv",date_format="%Y-%m-%d %H:%M:%S",header=["value"],index_label="local_time")

df_good = pd.read_csv("test_good.csv", header=0)
# results in AmbiguousTimeError
#df_good['utc_time'] = pd.to_datetime(df_good["local_time"]).dt.tz_localize('CET').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# works
df_good['utc_time'] = pd.to_datetime(df_good["local_time"]).dt.tz_localize('CET', ambiguous="infer").dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# works
df_good = add_utc_from_localtime(df_good)

#generate bad handled data
idx = pd.DatetimeIndex(start="1.1.2017",end="01.01.2018",freq="H",closed="left")
df = pd.DataFrame(data=np.arange(0.0,float(len(idx))),index=idx)
df["2017-03-26 03:00":"2017-10-29 01:00"] -= 1 # simulate bad handling
df.to_csv("test_bad.csv",date_format="%Y-%m-%d %H:%M:%S",header=["value"],index_label="local_time")

df_bad = pd.read_csv("test_bad.csv", header=0)
# results in NonExistentTimeError
#df_bad['utc_time'] = pd.to_datetime(df_bad["local_time"]).dt.tz_localize('CET').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# results in NonExistentTimeError
#df_bad['utc_time'] = pd.to_datetime(df_bad["local_time"]).dt.tz_localize('CET', ambiguous="infer").dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# works
df_bad = add_utc_from_localtime(df_bad)

当然，如果我错过了其他更优雅的方式，我也会很乐意学习（也许我会再提出另一个问题）。

Answer 3

我遇到了与Marcus V's answer中所述的真实数据集相同的问题。我的矿山在3月的凌晨3点有一个价值，InconsistentTimeError上升，而在10月的凌晨2点只有一个价值，在没有AmbiguousTimeError: Cannot infer dst time from %r, try using the 'ambiguous' argument和ambiguous='infer'的情况下上升ValueError: Cannot infer offset with only one time.。

这是我想出的解决此类数据集问题的方法，只要它可以帮助任何人：

def cet_to_utc(df, col_name):
   # Convert dataframe CET/CEST datetimes column to UTC datetimes
   # Example call: cet_to_utc(dataframe, 'Datetime')
   #
   # --- Arguments description --
   # You need to provide as first argument the dataframe you want to modify,
   # and as second argument the column you want to modify.
   idx_name = df.index.name
   df = df.reset_index()
   idx = 0
   while idx != df.index[-1] + 1:
       try:
           df.loc[idx, 'temp'] = pd.to_datetime(df.loc[idx, col_name]).tz_localize('CET').tz_convert('UTC')
           idx += 1
       except:

       # AmbiguousTimeError
       if df.loc[idx, col_name].month == 10:
          # Duplicate the single value we had at 2 am
          df = df.iloc[:idx, ].append(df.iloc[idx]).append(df.iloc[idx:, ]).reset_index(drop=True)
          # Convert both rows to UTC
          df.loc[idx, 'temp'] = pd.to_datetime(
                 pd.to_datetime(df.loc[idx, col_name]) - pd.Timedelta(hours=2)).tz_localize('UTC')
          df.loc[idx + 1, 'temp'] = pd.to_datetime(
                 pd.to_datetime(df.loc[idx, col_name]) - pd.Timedelta(hours=1)).tz_localize('UTC')
          idx += 2

       # InconsistentTimeError
       else:
          # Delete the 3 am row
          df.drop(idx, inplace=True)
          df = df.sort_index().reset_index(drop=True)

   df[col_name] = df['temp']
   df = df.drop(labels='temp', axis=1)
   if idx_name:
      df = df.set_index(idx_name)
      df.index.name = idx_name
   else:
      df = df.set_index('index')
      df.index.name = None
   return df

CET和CEST转换为UTC

3 个答案: