将原始日期格式转换为熊猫日期对象

时间:2018-07-21 06:33:39

标签: python pandas numpy

我有一个看起来像这样的CSV文件:

time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,

忘记数字列,我担心将CSV文件中的此日期格式转换为熊猫时间戳,因此我可以绘制数据集并根据时间对其进行可视化,因为我是数据科学领域的新手,这是我的方法:

step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,

看起来像这样,2017-03-22 15:16:45就像您所知道的,我是新手,我的做法既幼稚又错误,如果有人可以帮助我提供代码段,我将非常高兴,谢谢< / p>

1 个答案:

答案 0 :(得分:2)

您可以将格式传递给window.onload,在这种情况下:pd.to_datetime()。 请注意错误数据,但请注意以下示例数据的第3行([30 / Apr / 1998:21:32:3l8,671)。为了不出错,您可以传递[%d/%b/%Y:%H:%M:%S,并返回Not Time(NaT)。

另一种方法是手动替换这些行,或者先编写某种正则表达式/替换功能。

errors=coerce

返回:

import pandas as pd

data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136       
[30/Apr/1998:21:32:58,-      
[30/Apr/1998:21:32:59,231'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])

df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)

请注意:此处使用 time Numbers 0 1998-04-30 21:30:17 24736.0 1 1998-04-30 21:30:53 24736.0 2 NaT 671.0 3 1998-04-30 21:32:38 1512.0 4 1998-04-30 21:32:38 1136.0 5 1998-04-30 21:32:58 NaN 6 1998-04-30 21:32:59 231.0 来帮助熊猫了解Numbers列实际上是数字而不是字符串。


现在我们可以执行分组操作(例如,每分钟):

na_values=['-']