我认为这可能是遗留错误的结果,但是我一直无法找出问题的根源。
环境是:
-Python 2.7
-熊猫0.19.1
使用熊猫读取/处理/写入CSV数据会导致输出中出现罕见错误。
csv_data = """
timestamp,outdoor temperature
2019-01-10 07:16:38.758659,17.5
2019-01-10 07:31:51.449437,16.9
2019-01-10 07:47:04.458140,17.5
2019-01-10 08:02:17.372576,17.8
2019-01-10 08:17:30.156140,18.3
2019-01-10 08:32:42.878982,19.2
2019-01-10 08:47:55.782450,19.9
2019-01-10 09:03:08.907534,21.0
2019-01-10 09:18:21.599587,21.3
2019-01-10 09:33:34.572015,21.8
2019-01-10 09:48:47.524057,22.5
2019-01-10 10:04:00.420671,23.3
2019-01-10 10:19:13.187784,24.2
2019-01-10 10:34:26.118712,24.2
2019-01-10 10:49:39.000694,24.5
2019-01-10 11:04:51.870451,25.6
2019-01-10 11:20:04.763880,26.0
2019-01-10 11:35:17.541427,26.4
2019-01-10 11:50:30.252781,27.1
"""
I / O和处理代码的核心(包含一两行我认为与该问题无关,但出于完整性考虑而包含在内)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import datetime as dt
import pandas as pd
# Load the entire CSV
df = pd.read_csv(full_path, delimiter=',', encoding='utf-8')
# Coerce column 0 name
df.columns = [c.lower() if c == 'Timestamp' else c for c in df.columns]
column_names = list(df)
# Convert column 0 to datetime
df[column_names[0]] = pd.to_datetime(df[column_names[0]], errors='coerce', format="%Y-%m-%d %H:%M:%S.%f").astype(dt.datetime)
# Drop obs older than delta hours
cut_off = dt.datetime.now() - dt.timedelta(hours=72)
df = df[df[column_names[0]] >= cut_off]
# Add a new observation to the end of the dataframe
df = df.append({column_names[0]: dt.datetime.now(), column_names[1]: '12.3'}, ignore_index=True)
# Keep max 300 obs
df = df.tail(300)
# Replace the CSV with the revised dataframe
df.to_csv(full_path, sep=',', encoding='utf-8', index=False)
偶尔会产生此输出(包括看起来像是在POSIX纪元字符串中附加到第1列观察结果的部分):
timestamp,outdoor temperature
2019-01-10 07:16:38.758659,17.5
2019-01-10 07:31:51.449437,16.9
2019-01-10 07:47:04.458140,17.5
2019-01-10 08:02:17.372576,17.8
2019-01-10 08:17:30.156140,18.3-01-01 00:00:00
2019-01-10 08:32:42.878982,19.2
2019-01-10 08:47:55.782450,19.9
2019-01-10 09:03:08.907534,21.0
2019-01-10 09:18:21.599587,21.3
2019-01-10 09:33:34.572015,21.8
2019-01-10 09:48:47.524057,22.5
2019-01-10 10:04:00.420671,23.3
2019-01-10 10:19:13.187784,24.2-01-01 00:00:00
2019-01-10 10:34:26.118712,24.2
2019-01-10 10:49:39.000694,24.5
2019-01-10 11:04:51.870451,25.6
2019-01-10 11:20:04.763880,26.0
2019-01-10 11:35:17.541427,26.4
2019-01-10 11:50:30.252781,27.1
我尝试使用.append()
和.extend()
以及编码/解码的各种组合,但是我发现此错误非常不一致并且很难重现。我最初怀疑这可能是扩展Unicode字符或种族的结果,但我相信我已经消除了这两种可能性。如果不是python代码的罪魁祸首,我可以修补pandas库(但不能将其更新为较新版本),并且必须保留在python 2.7上。
我想避免蛮力攻击,因为我只需要迭代第1列obs并放下邪恶位即可。任何建议将不胜感激。预先感谢。
其他研究表明,.astype()
可能会不时变得烦躁,并且与errors='coerce'
结合使用可能掩盖了问题。使用其他形式的日期转换是否会增加任何价值,例如:
df[column_names[0]] = [dateutil.parser.parse(obs) for obs in df[column_names[0]]]