如何使用Pandas从CSV处理不规则格式

时间:2018-05-08 04:37:05

标签: python pandas

我一直在尝试用不同的方法从CSV处理Pandas中的日期时间。

我在csv文件中有3列:

  1. kickoffDate
  2. kickoffTime
  3. 时间
  4. 前两列格式正确。但第三栏“时间”的格式混淆了......有些代表时间,有些代表日期时间。

    例如:

    12:00:00 AM
    1/1/1900 9:04:00 PM
    

    如何使用pandas制作相同的格式?

    第二件事是我想添加所有三列以获得事件时间。

    例如:

    kickoffDate =['8/6/2017','8/6/2017','8/6/2017']
    kickoffTime =['15:00:00','15:00:00','15:00:00']
    time =['51:48:00','86:05:00','10:04']`
    

    这种情况下的时间是mm:ss:00格式。我想将这三列组合起来创建一个名为eventdatetime的新列:

    eventdatetime = [06-08-2017 15:51:48, 06-08-2017 16:26:05,06-08-2017 15:10:04]`
    

    我该怎么做?我可以使用以下公式组合前两个:

    DateTime1 = data['kickoffDate']+' '+ data ['kickoffTime']
    

    原始csv文件可以从以下链接下载:

      

    https://drive.google.com/open?id=1JL65x7nq2m6zk4qnaRUDKL894aEdXW_B

1 个答案:

答案 0 :(得分:1)

您可以将参数parse_datesdatetimes的第一列和第二列一起使用,然后将最后一列to_timedelta与添加00:一起转换为数小时,而不是:的值}}:

df = pd.read_csv('Datetimetest.csv', parse_dates=[[0, 1]])

m = df['time'].str.count(':') != 1
df['time'] = pd.to_timedelta('00:' + df['time'].mask(m, df['time'].str.replace(':00$', '')))
df['eventdatetime'] = df['kickoffDate_kickoffTime'] + df['time']
print (df.head())
  kickoffDate_kickoffTime     time       eventdatetime
0     2018-04-30 19:00:00 00:47:36 2018-04-30 19:47:36
1     2018-04-30 19:00:00 00:15:28 2018-04-30 19:15:28
2     2018-04-29 13:15:00 00:52:03 2018-04-29 14:07:03
3     2018-04-29 13:15:00 01:03:42 2018-04-29 14:18:42
4     2018-04-29 13:15:00 00:10:43 2018-04-29 13:25:43

另一种解决方案,如果在输出中需要4个单独的列 - 仅将第一列转换为日期时间,将另一列转换为timedelta

df = pd.read_csv('Datetimetest.csv', parse_dates=[0])

m = df['time'].str.count(':') != 1
df['time'] = pd.to_timedelta('00:' + df['time'].mask(m, df['time'].str.replace(':00$', '')))
df['kickoffTime'] = pd.to_timedelta(df['kickoffTime'])
df['eventdatetime'] = df['kickoffDate'] + df['kickoffTime'] + df['time']
print (df.head())
  kickoffDate kickoffTime     time       eventdatetime
0  2018-04-30    19:00:00 00:47:36 2018-04-30 19:47:36
1  2018-04-30    19:00:00 00:15:28 2018-04-30 19:15:28
2  2018-04-29    13:15:00 00:52:03 2018-04-29 14:07:03
3  2018-04-29    13:15:00 01:03:42 2018-04-29 14:18:42
4  2018-04-29    13:15:00 00:10:43 2018-04-29 13:25:43

编辑:

如果输入数据不是csv,则可以使用to_datetime代替parse_dates中的参数read_csv来将第一列转换为日期时间:

df = pd.read_csv('Datetimetest.csv')

m = df['time'].str.count(':') != 1
df['time'] = pd.to_timedelta('00:' + df['time'].mask(m, df['time'].str.replace(':00$', '')))

df['kickoffDate'] = pd.to_datetime(df['kickoffDate'])
df['kickoffTime'] = pd.to_timedelta(df['kickoffTime'])
df['eventdatetime'] = df['kickoffDate'] + df['kickoffTime'] + df['time']
print (df.head())
  kickoffDate kickoffTime     time       eventdatetime
0  2018-04-30    19:00:00 00:47:36 2018-04-30 19:47:36
1  2018-04-30    19:00:00 00:15:28 2018-04-30 19:15:28
2  2018-04-29    13:15:00 00:52:03 2018-04-29 14:07:03
3  2018-04-29    13:15:00 01:03:42 2018-04-29 14:18:42
4  2018-04-29    13:15:00 00:10:43 2018-04-29 13:25:43