我有一个看起来像这样的CSV文件:
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,
忘记数字列,我担心将CSV文件中的此日期格式转换为熊猫时间戳,因此我可以绘制数据集并根据时间对其进行可视化,因为我是数据科学领域的新手,这是我的方法:
step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,
看起来像这样,2017-03-22 15:16:45
就像您所知道的,我是新手,我的做法既幼稚又错误,如果有人可以帮助我提供代码段,我将非常高兴,谢谢< / p>
答案 0 :(得分:2)
您可以将格式传递给window.onload
,在这种情况下:pd.to_datetime()
。
请注意错误数据,但请注意以下示例数据的第3行([30 / Apr / 1998:21:32:3l8,671)。为了不出错,您可以传递[%d/%b/%Y:%H:%M:%S
,并返回Not Time(NaT)。
另一种方法是手动替换这些行,或者先编写某种正则表达式/替换功能。
errors=coerce
返回:
import pandas as pd
data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])
df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)
请注意:此处使用 time Numbers
0 1998-04-30 21:30:17 24736.0
1 1998-04-30 21:30:53 24736.0
2 NaT 671.0
3 1998-04-30 21:32:38 1512.0
4 1998-04-30 21:32:38 1136.0
5 1998-04-30 21:32:58 NaN
6 1998-04-30 21:32:59 231.0
来帮助熊猫了解Numbers列实际上是数字而不是字符串。
现在我们可以执行分组操作(例如,每分钟):
na_values=['-']