使用熊猫在数据框上转换日期或读取csv文件时出错

时间:2020-05-22 12:48:39

标签: python pandas

我需要使用日期格式为“ year.decimal day”的大熊猫(例如“ 1980.042”)导入csv文件,其格式为“ DD / MM / YYYY”,“ 11/02 /” 1980'。

文件样本:

data
1980.042
1980.125
1980.208
1980.292
1980.375
1980.458
1980.542
1980.625
1980.708

使用pd.to_datetime,我可以像这样转换它:

d = '1980.042'
print(pd.to_datetime(d, format = '%Y.%j'))

输出:

1980-02-11 00:00:00

我的第一次尝试是读取文件并转换dataframe列:

import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = p
d.to_datetime(df['data'], '%Y.%j')

输出:

data    float64
dtype: object 

        data
0  1980.042
1  1980.125
2  1980.208
3  1980.292
4  1980.375

Traceback (most recent call last):
  File "datas.py", line 4, in <module>
    df['data'] = pd.to_datetime(df['data'], '%Y.%j')
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
    values = _convert_listlike(arg._values, True, format)
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
    require_iso8601=require_iso8601
  File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError

第二次尝试是将列转换为str,然后转换为日期:

import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())

df['data'] = df['data'].astype(str)
df['data'] = pd.to_datetime(df['data'], '%Y.%j')

输出:

data    float64
dtype: object 

        data
0  1980.042
1  1980.125
2  1980.208
3  1980.292
4  1980.375

Traceback (most recent call last):
  File "datas.py", line 6, in <module>
    df['data'] = pd.to_datetime(df['data'], '%Y.%j')
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
    values = _convert_listlike(arg._values, True, format)
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
    require_iso8601=require_iso8601
  File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError

然后我意识到,对于某些内部浮点问题,数据获得的位数超过了三个小数位。因此在转换之前,我将其舍入到小数点后三位:

import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].round(3).astype(str)
print(df.dtypes, '\n\n', df.head())
df['data'] = pd.to_datetime(df['data'], '%Y.%j')

输出:

data    float64
dtype: object 

        data
0  1980.042
1  1980.125
2  1980.208
3  1980.292
4  1980.375

data    object
dtype: object 

        data
0  1980.042
1  1980.125
2  1980.208
3  1980.292
4  1980.375

Traceback (most recent call last):
  File "datas.py", line 8, in <module>
    df['data'] = pd.to_datetime(df['data'], '%Y.%j')
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
    values = _convert_listlike(arg._values, True, format)
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
    require_iso8601=require_iso8601
  File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
  File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError

最后,我查看了pandas文档,并在一些论坛中可以在读取文件时定义数据类型并应用lambda函数:

import pandas as pd

date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')

df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)

print(df.dtypes, '\n\n', df.head())

输出:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 377, in _convert_listlike
    values, tz = conversion.datetime_to_datetime64(arg)
  File "pandas/_libs/tslibs/conversion.pyx", line 188, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "datas.py", line 5, in <module>
    df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1921, in read
    names, data = self._do_date_conversions(names, data)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1675, in _do_date_conversions
    self.index_names, names, keep_date_col=self.keep_date_col)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3066, in _process_date_conversion
    data_dict[colspec] = converter(data_dict[colspec])
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3033, in converter
    return generic_parser(date_parser, *date_cols)
  File "/usr/lib/python3/dist-packages/pandas/io/date_converters.py", line 39, in generic_parser
    results[i] = parse_func(*args)
  File "datas.py", line 3, in <lambda>
    date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 469, in to_datetime
    result = _convert_listlike(np.array([arg]), box, format)[0]
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 380, in _convert_listlike
    raise e
  File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 347, in _convert_listlike
    errors=errors)
  File "pandas/_libs/tslibs/strptime.pyx", line 163, in pandas._libs.tslibs.strptime.array_strptime
ValueError: unconverted data remains: 5

无论如何,没有任何效果,有人去过那里吗?对于使用正确的数据类型读取文件或转换数据框上的列有任何建议吗?

2 个答案:

答案 0 :(得分:0)

您可以尝试使用datetime模块。您可以尝试以下代码:-

import pandas as pd
import numpy as np
import datetime

import pandas as pd
df = pd.read_csv('datas.csv',dtype=str)
df["data"] = df["data"].map(lambda x: datetime.datetime.strptime(x,'%Y.%j'))

但是此代码将失败。因为您的数据有问题。

1980.375
1980.458
1980.542
1980.625
1980.708

对于这些值,如果您看到天数大于365(小数点后3位),那是不可能的,这就是为什么它将引发错误。

希望这会有所帮助!

您也可以尝试下面的代码,它更干净:-

import pandas as pd 
import datetime 
date_parser = lambda x: datetime.datetime.strptime(x, '%Y.%j') 
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser) 
print(df)

答案 1 :(得分:0)

我真的没有意识到数据的问题。

除去小数部分大于365的那些,我测试了Tuhin Sharma的想法。

不幸的是,它返回所有数据帧行的第一行的值。

但是,当读取文件时,我按照laimda函数在Tumb Sharma中建议使用datetime模块,如下所示:

示例文件:

data
1980.042
1980.125
1980.208
1980.292

代码:

import pandas as pd
import datetime
date_parser = lambda col: datetime.datetime.strptime(col, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)

输出:

        data
0 1980-02-11
1 1980-05-04
2 1980-07-26
3 1980-10-18
相关问题