我需要使用日期格式为“ year.decimal day”的大熊猫(例如“ 1980.042”)导入csv文件,其格式为“ DD / MM / YYYY”,“ 11/02 /” 1980'。
文件样本:
data
1980.042
1980.125
1980.208
1980.292
1980.375
1980.458
1980.542
1980.625
1980.708
使用pd.to_datetime,我可以像这样转换它:
d = '1980.042'
print(pd.to_datetime(d, format = '%Y.%j'))
输出:
1980-02-11 00:00:00
我的第一次尝试是读取文件并转换dataframe列:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = p
d.to_datetime(df['data'], '%Y.%j')
输出:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 4, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
第二次尝试是将列转换为str,然后转换为日期:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].astype(str)
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
输出:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 6, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
然后我意识到,对于某些内部浮点问题,数据获得的位数超过了三个小数位。因此在转换之前,我将其舍入到小数点后三位:
import pandas as pd
df = pd.read_csv('datas.csv')
print(df.dtypes, '\n\n', df.head())
df['data'] = df['data'].round(3).astype(str)
print(df.dtypes, '\n\n', df.head())
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
输出:
data float64
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
data object
dtype: object
data
0 1980.042
1 1980.125
2 1980.208
3 1980.292
4 1980.375
Traceback (most recent call last):
File "datas.py", line 8, in <module>
df['data'] = pd.to_datetime(df['data'], '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas/_libs/tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas/_libs/tslib.pyx", line 513, in pandas._libs.tslib.array_to_datetime
AssertionError
最后,我查看了pandas文档,并在一些论坛中可以在读取文件时定义数据类型并应用lambda函数:
import pandas as pd
date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df.dtypes, '\n\n', df.head())
输出:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 377, in _convert_listlike
values, tz = conversion.datetime_to_datetime64(arg)
File "pandas/_libs/tslibs/conversion.pyx", line 188, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "datas.py", line 5, in <module>
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 446, in _read
data = parser.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1921, in read
names, data = self._do_date_conversions(names, data)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1675, in _do_date_conversions
self.index_names, names, keep_date_col=self.keep_date_col)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3066, in _process_date_conversion
data_dict[colspec] = converter(data_dict[colspec])
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 3033, in converter
return generic_parser(date_parser, *date_cols)
File "/usr/lib/python3/dist-packages/pandas/io/date_converters.py", line 39, in generic_parser
results[i] = parse_func(*args)
File "datas.py", line 3, in <lambda>
date_parser = lambda col: pd.to_datetime(str(col), format = '%Y.%j')
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 469, in to_datetime
result = _convert_listlike(np.array([arg]), box, format)[0]
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 380, in _convert_listlike
raise e
File "/usr/lib/python3/dist-packages/pandas/core/tools/datetimes.py", line 347, in _convert_listlike
errors=errors)
File "pandas/_libs/tslibs/strptime.pyx", line 163, in pandas._libs.tslibs.strptime.array_strptime
ValueError: unconverted data remains: 5
无论如何,没有任何效果,有人去过那里吗?对于使用正确的数据类型读取文件或转换数据框上的列有任何建议吗?
答案 0 :(得分:0)
您可以尝试使用datetime
模块。您可以尝试以下代码:-
import pandas as pd
import numpy as np
import datetime
import pandas as pd
df = pd.read_csv('datas.csv',dtype=str)
df["data"] = df["data"].map(lambda x: datetime.datetime.strptime(x,'%Y.%j'))
但是此代码将失败。因为您的数据有问题。
1980.375
1980.458
1980.542
1980.625
1980.708
对于这些值,如果您看到天数大于365(小数点后3位),那是不可能的,这就是为什么它将引发错误。
希望这会有所帮助!
您也可以尝试下面的代码,它更干净:-
import pandas as pd
import datetime
date_parser = lambda x: datetime.datetime.strptime(x, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)
答案 1 :(得分:0)
我真的没有意识到数据的问题。
除去小数部分大于365的那些,我测试了Tuhin Sharma的想法。
不幸的是,它返回所有数据帧行的第一行的值。
但是,当读取文件时,我按照laimda函数在Tumb Sharma中建议使用datetime模块,如下所示:
示例文件:
data
1980.042
1980.125
1980.208
1980.292
代码:
import pandas as pd
import datetime
date_parser = lambda col: datetime.datetime.strptime(col, '%Y.%j')
df = pd.read_csv('datas.csv', parse_dates = ['data'], date_parser = date_parser)
print(df)
输出:
data
0 1980-02-11
1 1980-05-04
2 1980-07-26
3 1980-10-18