numpy UnicodeDecodeError我使用genfromtxt的正确方法

时间:2016-03-22 21:18:18

标签: numpy

我被困住了。我想把一个简单的csv文件读成Numpy数组,似乎把自己挖成了一个洞。我是Numpy的新手,我确定我已经把它搞砸了,因为通常我可以在Python 3.4中轻松阅读CSV文件。我不想使用熊猫,所以我想我会用Numpy增加我的技能,但我真的根本没有得到这个。如果有人可以使用genfromtxt告诉我我是否在正确的轨道上,或者是否有一种更简单的方法并且给我一个正确方向的推动,我将不胜感激。 我想在CSV文件中读取操作日期时间列到2014年8月4日,然后将其与其余列一起放在一个numpy数组中。这是我到目前为止所遇到的错误以及我在编码时遇到的错误。我可以在那里获得日期,但是不知道如何将date.strftime("%Y-%m-%d")添加到datefunc。此外,我还没有看到如何格式化SYM的字符串以解决错误。任何帮助将不胜感激。

数据

 2015-08-04 02:14:05.249392, AA, 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99 
 2015-08-04 02:14:05.325113, AAPL, 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75 
 2015-08-04 02:14:05.415193, AIG, 0.0080808151, 0.0073296055, 0.0076213535, 12.8278962785, 11.635388035, 12.0985236788, -9.2962105215, 3.980405659, -142.8175077335, 71, 42, 33 
 2015-08-04 02:14:05.486185, AMZN, 0.0235649449, 0.0305828226, 0.0092703502, 37.4081902773, 48.5487257749, 14.7162247572, 29.7810062852, -69.6877219282, -334.0005615016, 2, 92, 10 

"代码"抱歉还在学习

import numpy as np

from datetime import datetime
from datetime import date,time


datefunc = lambda x: datetime.strptime(x.decode("utf-8"), '%Y-%m-%d %H:%M:%S.%f')
a = np.genfromtxt('/home/dave/Desktop/development/hvanal2016.csv',delimiter = ',',
converters = {0:datefunc},dtype='object,str,float,float,float,float,float,float,float,float,float,float,float,float',
names = ["date","sym","20sd","10sd","5sd","hv20","hv10","hv5","2010hv","105hv","abshv","2010rank","105rank","absrank"])

print(a["date"])
print(a["sym"])
print(a["20sd"])
print(a["hv20"])
print(a["absrank"])

错误

Python 3.4.3+ (default, Oct 14 2015, 16:03:50) 
[GCC 5.2.1 20151010] on linux
Type "copyright", "credits" or "license()" for more information.
>>> 
============================================================================== RESTART: /home/dave/3 9 15 my slope.py ===============================================================================
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
 datetime.datetime(2015, 8, 4, 2, 14, 5, 325113)
 datetime.datetime(2015, 8, 4, 2, 14, 5, 415193) ...,
 datetime.datetime(2016, 3, 18, 1, 0, 25, 925754)
 datetime.datetime(2016, 3, 18, 1, 0, 26, 26400)
 datetime.datetime(2016, 3, 18, 1, 0, 26, 114828)]
 Traceback (most recent call last):
 File "/home/dave/3 9 15 my slope.py", line 19, in <module>
  print(a["sym"])
 File "/usr/lib/python3/dist-packages/numpy/core/numeric.py", line 1615, in array_str
 return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 454, in array2string
separator, prefix, formatter=formatter)
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 328, in _array2string
_summaryEdgeItems, summary_insert)[:-1]
File "/usr/lib/python3/dist-packages/numpy/core/arrayprint.py", line 490, in _formatArray
word = format_function(a[i]) + separator
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000) 

1 个答案:

答案 0 :(得分:1)

所以你的部分文字是

b'2015-08-04 02:14:05.249392 AA 0.0193103612 ...'

(我正在使用b,因为Py3 genfromtxt打开文件字节串。)

但是您指定了,分隔符。我没有看到任何逗号。

让我们尝试一下基本的负载,而不是花哨的业务。

In [97]: txt=b"""2015-08-04 02:14:05.249392 AA 0.0193103612 0.0193515212 0.0249713335 30.6542480634 30.7195875454 39.640763021 0.2131498442 29.0406746589 13524.5347810182 89 57 99 
 2015-08-04 02:14:05.325113 AAPL 0.0170506271 0.0137941891 0.0105915637 27.0670313481 21.8975963326 16.8135861893 -19.0986405157 -23.2172064279 21.5647072302 33 26 75 
 """
In [98]: txt=txt.splitlines()
In [99]: data=np.genfromtxt(txt,dtype=None)
In [100]: data
Out[100]: 
array([ (b'2015-08-04', b'02:14:05.249392', b'AA', 0.0193103612, 0.0193515212, 0.0249713335, 30.6542480634, 30.7195875454, 39.640763021, 0.2131498442, 29.0406746589, 13524.5347810182, 89, 57, 99),
       (b'2015-08-04', b'02:14:05.325113', b'AAPL', 0.0170506271, 0.0137941891, 0.0105915637, 27.0670313481, 21.8975963326, 16.8135861893, -19.0986405157, -23.2172064279, 21.5647072302, 33, 26, 75)], 
      dtype=[('f0', 'S10'), ('f1', 'S15'), ('f2', 'S4'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<i4'), ('f13', '<i4'), ('f14', '<i4')])

日期时间信息位于2个字段中:

In [101]: data[['f0','f1']]
Out[101]: 
array([(b'2015-08-04', b'02:14:05.249392'),
       (b'2015-08-04', b'02:14:05.325113')], 
      dtype=[('f0', 'S10'), ('f1', 'S15')])

你的日期函数适用于字节子字符串

In [102]: datefunc(b'2015-08-04 02:14:05.249392')
Out[102]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)

但它需要2个字段(由''分隔符定义)。因此,我们需要找出一种方法将这两个子串解析为一个,而不是分成两个字段。

也许我会尝试将示例txt更改为真正使用,分隔符(但不是在日期和时间之间)并设置有效的方法。

我得到,分隔文字:

In [117]: data=np.genfromtxt(txt,delimiter=',',dtype=None,usecols=[0,1,2,3])
In [118]: data.dtype
Out[118]: dtype([('f0', 'S26'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])
In [119]: data['f0']
Out[119]: 
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
       b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], 
      dtype='|S26')
In [120]: [datefunc(d) for d in data['f0']]
Out[120]: 
[datetime.datetime(2015, 8, 4, 2, 14, 5, 249392),
 datetime.datetime(2015, 8, 4, 2, 14, 5, 325113),
 datetime.datetime(2015, 8, 4, 2, 14, 5, 415193),
 datetime.datetime(2015, 8, 4, 2, 14, 5, 486185)]

我使用了usecols因为全文在第1行有14个字段,在其他行有13个字段。

如果我指定dtype(而不是easy None),我可以使用这些日期时间对象替换第1个字段中的字符串:

In [122]: data=np.genfromtxt(txt,delimiter=',',dtype='O,S5,f,f',usecols=[0,1,2,3])
In [123]: data
Out[123]: 
array([ (b'2015-08-04 02:14:05.249392', b' AA', 0.01931036077439785, 0.019351521506905556),
       (b'2015-08-04 02:14:05.325113', b' AAPL', 0.01705062761902809, 0.01379418931901455),....], 
      dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])
In [124]: data['f0']
Out[124]: 
array([b'2015-08-04 02:14:05.249392', b'2015-08-04 02:14:05.325113',
       b'2015-08-04 02:14:05.415193', b'2015-08-04 02:14:05.486185'], dtype=object)
....
In [126]: data['f0']=[datefunc(d) for d in data['f0']]
In [127]: data
Out[127]: 
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.01931036077439785, 0.019351521506905556),
       (datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.01705062761902809, 0.01379418931901455),...], 
      dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f4'), ('f3', '<f4')])

使用转换器,您的通话工作(或多或少)

In [133]: data=np.genfromtxt(txt,dtype='object,S5,float,float',
   converters = {0:datefunc},delimiter=',',usecols=[0,1,2,3])
In [134]: data
Out[134]: 
array([ (datetime.datetime(2015, 8, 4, 2, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
       (datetime.datetime(2015, 8, 4, 2, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...], 
      dtype=[('f0', 'O'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])

the numpy datetime64使用此字符串。这些类型可以使用numpy数字。

In [154]: datefunc(b'2015-08-04 02:14:05.249392')
Out[154]: datetime.datetime(2015, 8, 4, 2, 14, 5, 249392)
In [155]: np.datetime64(b'2015-08-04 02:14:05.249392')
Out[155]: numpy.datetime64('2015-08-04T02:14:05.249392-0700')

从这Importing csv into Numpy datetime64我开始工作:

In [175]: data=np.genfromtxt(txt,dtype='M8[us],S5,float,float',
   delimiter=',',usecols=[0,1,2,3])
In [176]: data
Out[176]: 
array([ (datetime.datetime(2015, 8, 4, 9, 14, 5, 249392), b' AA', 0.0193103612, 0.0193515212),
       (datetime.datetime(2015, 8, 4, 9, 14, 5, 325113), b' AAPL', 0.0170506271, 0.0137941891),...], 
      dtype=[('f0', '<M8[us]'), ('f1', 'S5'), ('f2', '<f8'), ('f3', '<f8')])

查看日期时间单位:http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units