将csv导入Numpy datetime64

时间:2011-09-29 15:26:53

标签: python numpy

我正在尝试最新版本的numpy 2.0 dev:

np.__version__
Out[44]: '2.0.0.dev-aded70c'

我正在尝试导入如下所示的CSV数据:

date,system,pumping,rgt,agt,sps,eskom_import,temperature,wind,pressure,weather
2007-01-01 00:30,481.9,,,,,481.9,15,SW,1040,Fine
2007-01-01 01:00,471.9,,,,,471.9,15,SW,1040,Fine
2007-01-01 01:30,455.9,,,,,455.9,,,,

使用以下代码:

convertdict = {0: lambda s: np.datetime64(s, 'm'), 1: lambda s: float(s or 0), 2: lambda s: float(s or 0), 3: lambda s: float(s or 0), 4: lambda s: float(s or 0), 5: lambda s: float(s or 0), 6: lambda s: float(s or 0), 7: lambda s: float(s or 0), 8: str, 9: str, 10: str}

dt = [('date', np.datetime64),('system', float), ('pumping', float),('rgt', 
float), ('agt', float), ('sps', float) ,('eskom_import', float),('temperature', float), ('wind', str), ('pressure', float), ('weather', str)]

a = np.recfromcsv(fp, dtype=dt, converters=convertdict, usecols=range(0-11), 
names=True)         

它为a.date生成的dtype是'object':

array([2007-01-01T00:30+0200, 2007-01-01T01:00+0200, 2007-01-01T01:30+0200,
       ..., 2007-12-31T23:00+0200, 2007-12-31T23:30+0200,
       2008-01-01T00:00+0200], dtype=object)

但我需要它是datetime64,就像在这个例子中一样(但包括hrs和 分钟):

array(['2011-07-11', '2011-07-12', '2011-07-13', '2011-07-14',
       '2011-07-15', '2011-07-16', '2011-07-17'], dtype='datetime64[D]')

CSV导入似乎为“日期”而不是datetime64数据类型创建了嵌入对象日期类型。关于如何解决这个问题的任何想法?

1 个答案:

答案 0 :(得分:1)

我认为避免泛型'对象'类型的技巧是避免使用recfromcsv函数。手动读取数据文件并解析信息会产生请求的dtype='datetime64[m]'

import numpy as np
dt = np.dtype([ ('date',        '<M8[m]'), 
                ('system',      '<f8'), 
                ('pumping',     '<f8'), 
                ('rgt',         '<f8'), 
                ('agt',         '<f8'), 
                ('sps',         '<f8'), 
                ('eskom_import','<f8'), 
                ('temperature', '<f8'), 
                ('wind',        np.str), 
                ('pressure',    '<f8'), 
                ('weather',     np.str) ])
numfields = len(dt.fields.keys())
data = np.zeros(numlines, dtype=dt)         
fid = open('data.csv', 'rb')
count = 0
try:
    fieldnames = fid.readline().strip().split(',') #Header
    for line in fid:
        parsedline = line.strip().split(',')
        data['date'][count]         = np.datetime64(parsedline[0], 'm')
        data['system'][count]       = np.double(parsedline[1])
        data['pumping'][count]      = np.double(parsedline[2])
        data['rgt'][count]          = np.double(parsedline[3])
        data['agt'][count]          = np.double(parsedline[4])
        data['sps'][count]          = np.double(parsedline[5])
        data['eskom_import'][count] = np.double(parsedline[6])
        data['temperature'][count]  = np.double(parsedline[7])
        data['wind'][count]         = np.str(parsedline[8])
        data['pressure'][count]     = np.double(parsedline[9])
        data['weather'][count]      = np.str(parsedline[10])
        count += 1
 finally:
     fid.close()

>>> data['date']
array(['2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500'], dtype='datetime64[m]')

您可以通过使用“convertdict”并迭代解析线来改进此代码,但想法是一样的。