Question

我已经将包含15列和100,000多行的csv文件导入到数据框中。其中一个栏目是“出生”，表示出生年份。在“诞生”栏中，实际上有3种不同的字符串类型格式，即以'02 -Aug-34'格式列出日期的格式，将其格式列为'29DEC1899'的格式，最后是空白字符串格式''。

我编写了一个脚本，可以对“出生”字符串的类型进行排序，然后将非空白字符串转换为给定日期的日期时间格式。我使用循环遍历带有行号的适当列表，将数据帧'birth'条目从字符串替换为日期时间，基本上覆盖了以前的值。

要经历100,000多个条目需要大约130秒。在给出3种不同的输入值情况下，是否有更有效的方式转换数据类型？完成时间（130秒）是否合理？

我很擅长使用熊猫。

Answer 1

您可以对每种格式使用to_datetime两次，然后combine_first：

同时02-Aug-15有时不能02-Aug-1815或02-Aug-1915或02-Aug-2015，因为无法区分它。

df = pd.DataFrame({'date':['02-Aug-34','29DEC1899','02-Aug-15','']})

#format 29DEC1899
d1 = pd.to_datetime(df['date'], format='%d%b%Y', errors='coerce')

#replace last - to 19
dates = df['date'].str.replace(r'(.*)-', r'\1-19')
#alternative1
#dates = df['date'].str[::-1].str.replace('-', '91-', n=1).str[::-1]
#alternative2
#dates = df['date'].str.rsplit('-', n=1).str.join('-19')

#format 02-Aug-34
d2 =  pd.to_datetime(dates, format='%d-%b-%Y', errors='coerce')

#combine formats
d = d1.combine_first(d2)
print (d)
0   1934-08-02
1   1899-12-29
2   1915-08-02
3          NaT
Name: date, dtype: datetime64[ns]

Answer 2

使用to_datetime：

http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_datetime.html

# Can be the same 'Date' column or different
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

您还可以使用＆＃34;格式=＆＃39;％d-％m-％Y＆＃39;＆＃34;指定日期格式，例如

更改数据框中一组条目的数据类型

2 个答案: