我有将欧洲/布鲁塞尔时间转换为UTC的代码。此代码是否会处理CET和CEST转换?即它是否处理日光节省转换以及UTC?如果没有,有人可以建议如何处理它吗?
df['datetime'] = pd.to_datetime(df['date'] + " " + df['time']).dt.tz_localize('Europe/Brussels').\
dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
以下数据可在荷兰时间获取。因此,它被转换为UTC。
1/17/2018 1 0:00
1/17/2018 2 0:01
1/17/2018 3 0:02
1/17/2018 4 0:03
1/17/2018 5 0:04
1/17/2018 6 0:05
1/17/2018 7 0:06
1/17/2018 8 0:07
答案 0 :(得分:3)
是的,它本身处理DST。检查一下:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['2017-08-30 12:00:00', '2017-12-30 12:00:00'])})
df['date'].dt.tz_localize('Europe/Brussels').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
我用DST选择了一个日期,即UTC + 1 + 1,另一个只有UTC + 1(其中+1代表布鲁塞尔)。输出显示,第一个日期转换为减去2小时,而第二个日期减去1小时。
0 2017-08-30 10:00:00
1 2017-12-30 11:00:00
答案 1 :(得分:2)
好的,抱歉,如果我将此用作未来自己参考的一种要点:)。但是,虽然@Michal Ficek的答案在技术上是正确的,但在我遇到的数据文件的现实生活中,它通常不适用于我。当我得到一个带有本地时间列的时间序列文件时,就像你的那样,90%的时间我都会得到一条例外。因此,我将检查从夏季到夏季过渡的情况。
理想情况下(至少在没有明确的偏移信息的情况下)你会想看到这样的东西:
#test_good.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 01:00,7224
2017-10-29 02:00,7225
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...
但大多数时候你会看到这个:
# test_bad.csv
local_time,value
...
2017-03-26 00:00,2016
2017-03-26 01:00,2017
2017-03-26 02:00,2018 # should not exist, so people made up number?
2017-03-26 03:00,2018
2017-03-26 04:00,2019
...
2017-10-29 00:00,7223
2017-10-29 01:00,7224 # so here is a value missing now
2017-10-29 02:00,7226
2017-10-29 03:00,7227
...
因此,如果您在test_good.csv上使用您的行,您将获得AmbiguousTimeError,但可以使用ambiguous="infer"
标志轻松处理:
df_good['utc_time'] = pd.to_datetime(df_good["local_time"]).dt.tz_localize('CET', ambiguous="infer").dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
然后一切都很好。
但不适用于test_bad.csv:没有标志会导致NonExistentTimeError
,因为有一个不应存在的时间戳。因此,请尝试ambiguous="infer"
并获得AmbiguousTimeError
,因为它不知道如何处理非reapeated时间。这可以通过ambiguous="NaT"
修复,再次抛出NonExistentTimeError
。是的,完整的圆圈。
到目前为止,我通过手动固定了几次(总是在谷歌搜索各自国家的夏季时间过渡日期)。所以今天早上我用你的问题来提出这个(虽然是hacky)函数:
def add_utc_from_localtime(df, local_time_column='local_time', values=['value']):
try: # here everything is as expected
df['utc_time'] = pd.to_datetime(df[local_time_column])
.dt.tz_localize('CET', ambiguous="infer")
.dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
except AmbiguousTimeError as e: # okay, so he excepts one line to be there twice
d = re.findall(r'from (.+?) as', str(e))[0] # get the date from error message
df.loc[df.index[-1] + 1,:] = [d, np.NaN] # add a line with this date at the end
df = df.sort_values(local_time_column) # sort according to date column
df[values] = df[values].interpolate() # make up some new value by interpolating
try:
df['utc_time'] = pd.to_datetime(df[local_time_column])
.dt.tz_localize('CET', ambiguous="infer")
.dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
except NonExistentTimeError as e2: # okay, now the problem is one date is there twice
df = df.drop(df[df.local_time == str(e2)].index) # drop it based on error message
df['utc_time'] = pd.to_datetime(df[local_time_column])
.dt.tz_localize('CET', ambiguous="infer")
.dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
return df
当然,这可能会因pandas更新而中断,因为它依赖于错误消息格式。但总是比手动经历数年的数据更好。
下面是一个包含测试数据的完整示例:
import pandas as pd
import numpy as np
from pytz.exceptions import AmbiguousTimeError, NonExistentTimeError
import re
#generate good data
idx = pd.DatetimeIndex(start="1.1.2017",end="01.01.2018",freq="H",closed="left", tz="CET")
df = pd.DataFrame(data=np.arange(0.0,float(len(idx))),index=idx)
df.to_csv("test_good.csv",date_format="%Y-%m-%d %H:%M:%S",header=["value"],index_label="local_time")
df_good = pd.read_csv("test_good.csv", header=0)
# results in AmbiguousTimeError
#df_good['utc_time'] = pd.to_datetime(df_good["local_time"]).dt.tz_localize('CET').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# works
df_good['utc_time'] = pd.to_datetime(df_good["local_time"]).dt.tz_localize('CET', ambiguous="infer").dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# works
df_good = add_utc_from_localtime(df_good)
#generate bad handled data
idx = pd.DatetimeIndex(start="1.1.2017",end="01.01.2018",freq="H",closed="left")
df = pd.DataFrame(data=np.arange(0.0,float(len(idx))),index=idx)
df["2017-03-26 03:00":"2017-10-29 01:00"] -= 1 # simulate bad handling
df.to_csv("test_bad.csv",date_format="%Y-%m-%d %H:%M:%S",header=["value"],index_label="local_time")
df_bad = pd.read_csv("test_bad.csv", header=0)
# results in NonExistentTimeError
#df_bad['utc_time'] = pd.to_datetime(df_bad["local_time"]).dt.tz_localize('CET').dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# results in NonExistentTimeError
#df_bad['utc_time'] = pd.to_datetime(df_bad["local_time"]).dt.tz_localize('CET', ambiguous="infer").dt.tz_convert('UTC').dt.strftime('%Y-%m-%d %H:%M:%S')
# works
df_bad = add_utc_from_localtime(df_bad)
当然,如果我错过了其他更优雅的方式,我也会很乐意学习(也许我会再提出另一个问题)。
答案 2 :(得分:0)
我遇到了与Marcus V's answer中所述的真实数据集相同的问题。我的矿山在3月的凌晨3点有一个价值,InconsistentTimeError
上升,而在10月的凌晨2点只有一个价值,在没有AmbiguousTimeError: Cannot infer dst time from %r, try using the 'ambiguous' argument
和ambiguous='infer'
的情况下上升ValueError: Cannot infer offset with only one time.
。
这是我想出的解决此类数据集问题的方法,只要它可以帮助任何人:
def cet_to_utc(df, col_name):
# Convert dataframe CET/CEST datetimes column to UTC datetimes
# Example call: cet_to_utc(dataframe, 'Datetime')
#
# --- Arguments description --
# You need to provide as first argument the dataframe you want to modify,
# and as second argument the column you want to modify.
idx_name = df.index.name
df = df.reset_index()
idx = 0
while idx != df.index[-1] + 1:
try:
df.loc[idx, 'temp'] = pd.to_datetime(df.loc[idx, col_name]).tz_localize('CET').tz_convert('UTC')
idx += 1
except:
# AmbiguousTimeError
if df.loc[idx, col_name].month == 10:
# Duplicate the single value we had at 2 am
df = df.iloc[:idx, ].append(df.iloc[idx]).append(df.iloc[idx:, ]).reset_index(drop=True)
# Convert both rows to UTC
df.loc[idx, 'temp'] = pd.to_datetime(
pd.to_datetime(df.loc[idx, col_name]) - pd.Timedelta(hours=2)).tz_localize('UTC')
df.loc[idx + 1, 'temp'] = pd.to_datetime(
pd.to_datetime(df.loc[idx, col_name]) - pd.Timedelta(hours=1)).tz_localize('UTC')
idx += 2
# InconsistentTimeError
else:
# Delete the 3 am row
df.drop(idx, inplace=True)
df = df.sort_index().reset_index(drop=True)
df[col_name] = df['temp']
df = df.drop(labels='temp', axis=1)
if idx_name:
df = df.set_index(idx_name)
df.index.name = idx_name
else:
df = df.set_index('index')
df.index.name = None
return df