Question

我读了一个包含日期的CSV文件。某些日期可能格式错误，我想找到这些日期。使用以下方法，我将期望第二行为NaT 。但是无论我设置infer_datetime_format还是exact，熊猫似乎都会忽略指定的格式。

import pandas as pd
from io import StringIO

DATA = StringIO("""date
2019 10 07
   2018 10
""")
df = pd.read_csv(DATA)

df['date'] = pd.to_datetime(df['date'], format="%Y %m %d", errors='coerce', exact=True)

产生

        date
0 2019-10-07
1 2018-10-01

pandas.to_datetime文档引用了strftime() and strptime() Behavior，但是当我使用普通Python测试它时，它可以工作：

datetime.datetime.strptime('  2018 10', '%Y %m %d')

我得到期望值错误：

ValueError: time data '  2018 10' does not match format '%Y %m %d'

我想念什么？

仅供参考：这个问题pandas to_datetime not working似乎相关，但有所不同，目前似乎已经解决。我的熊猫版本为0.25.2。

Answer 1

这是一个已知的错误，有关详细信息，请参见github。

由于我们需要一个解决方案，因此提出了以下解决方法。请注意，在我的问题中，我使用user_id | status abc | 1 abc | 1来使可重现的代码段小而简单。我们实际上使用read_csv，这是一些示例数据（time.txt）：

read_fwf

我觉得说行号也是个好主意，所以我增加了一些伏都教：

2019 10 07 + 14:45 15:00  # Foo
2019 10 07 + 18:00 18:30  # Bar
  2019 10 09 + 13:00 13:45  # Wrong indentation

解决方案基于此答案How to skip blank lines with read_fwf in pandas?。请注意，这不适用于class FileSanitizer(io.TextIOBase): row = 0 date_range = None def __init__(self, iterable, date_range): self.iterable = iterable self.date_range = date_range def readline(self): result = next(self.iterable) self.row += 1 try: datetime.datetime.strptime(result[self.date_range[0]:self.date_range[1]], "%Y %m %d") except ValueError as excep: raise ValueError(f'row: {self.row} => {str(excep)}') from ValueError return result filepath = 'time.txt' colspecs = [[0, 10], [13, 18], [19, 25], [26, None]] names = ['date', 'start', 'end', 'description'] with open(filepath, 'r') as file: df = pd.read_fwf(FileSanitizer(file, colspecs[0]), colspecs=colspecs, names=names, )。

现在我按预期得到以下错误：

read_csv

如果有人有更复杂的答案，我很高兴学习。

熊猫to_datetime格式错误

1 个答案: