Numpy设置dtype = None,不能拼接列并设置dtype = object不能设置dtype.names

时间:2013-07-19 06:01:25

标签: python numpy

我正在运行Python 2.6。我有以下示例,我试图连接csv文件中的日期和时间字符串列。基于我设置的dtype(无vs对象),我看到了一些我无法解释的行为差异,请参阅帖子末尾的问题1和2。返回的异常不太具描述性,dtype文档没有提到当dtype设置为object时所期望的任何特定行为。

以下是摘录:

#! /usr/bin/python

import numpy as np

# simulate a csv file
from StringIO import StringIO
data = StringIO("""
Title
Date,Time,Speed
,,(m/s)
2012-04-01,00:10, 85
2012-04-02,00:20, 86
2012-04-03,00:30, 87
""".strip())


# (Fail) case 1: dtype=None splicing a column fails

next(data)                                                      # eat away the title line
header = [item.strip() for item in next(data).split(',')]       # get the headers
arr1 = np.genfromtxt(data, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
arr1.dtype.names = header                                       # assign the header to names
                                                                # so we can do y=arr['Speed']
y1 = arr1['Speed']  

# Q1 IndexError: invalid index
#a1 = arr1[:,0] 
#print a1
# EDIT1: 
print "arr1.shape " 
print arr1.shape # (3,)

# Fails as expected TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'
# z1 = arr1['Date'] + arr1['Time'] 
# This can be workaround by specifying dtype=object, which leads to case 2

data.seek(0)        # resets

# (Fail) case 2: dtype=object assign header fails
next(data)                                                          # eat away the title line
header = [item.strip() for item in next(data).split(',')]           # get the headers
arr2 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1)  # skiprows=1 for the row with units

# Q2 ValueError: there are no fields define
#arr2.dtype.names = header # assign the header to names. so we can use it to do indexing
                         # ie y=arr['Speed']
# y2 = arr['Date'] + arr['Time']    # column headings were assigned previously by arr.dtype.names = header

data.seek(0)        # resets

# (Good) case 3: dtype=object but don't assign headers
next(data)                                                          # eat away the title line
header = [item.strip() for item in next(data).split(',')]           # get the headers
arr3 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1)  # skiprows=1 for the row with units
y3 = arr3[:,0] + arr3[:,1]                                          # slice the columns
print y3

# case 4: dtype=None, all data are ints, array dimension 2-D

# simulate a csv file
from StringIO import StringIO
data2 = StringIO("""
Title
Date,Time,Speed
,,(m/s)
45,46,85
12,13,86
50,46,87
""".strip())

next(data2)                                                      # eat away the title line
header = [item.strip() for item in next(data2).split(',')]       # get the headers
arr4 = np.genfromtxt(data2, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
#arr4.dtype.names = header # Value error
print "arr4.shape " 
print arr4.shape # (3,3)

data2.seek(0)        # resets

问题1:在评论Q1中,为什么我不能在dtype = None时切片? 这可以通过以下方式避免 a)arr1 = np-genfromtxt ...用dtype = object初始化,如情况3, b)arr1.dtype.names = ...已注释掉以避免案例2中的值错误

问题2:在评论Q2中,为什么我不能在dtype = object时设置dtype.names?

EDIT1:

添加了一个案例4,它显示了当模拟csv文件中的值全部为int时,数组的维度为2-D的时间。可以对列进行切片,但是分配dtype.names仍然会失败。

将“拼接”一词更新为“切片”。

1 个答案:

答案 0 :(得分:2)

问题1

这是索引,而不是“拼接”,您无法索引data的列,原因与我在回答Question 7 here之前向您解释的原因完全相同。查看arr1.shape - 它是(3,),即arr1是1D,而不是2D。没有列可供您索引。

现在看看arr2的形状 - 你会发现它是(3,3)。为什么是这样?如果您执行指定dtype=desired_typenp.genfromtxt会将输入字符串的每个分隔部分视为(即desired_type),它会给你一个普通的,非结构化的numpy数组

我不太确定你想用这条线做什么:

z1 = arr1['Date'] + arr1['Time'] 

您是不是要将日期和时间字符串连接在一起,如下所示:'2012-04-01 00:10'?你可以这样做:

z1 = [d + ' ' + t for d,t in zip(arr1['Date'],arr1['Time'])]

这取决于你想要对输出做什么(这将给你一个字符串列表,而不是一个numpy数组)。

我应该指出,从版本1.7开始,Numpy有core array types that support datetime functionality。这将允许你做更多有用的事情,如计算时间增量等。

dts = np.array(z1,dtype=np.datetime64)

修改 如果要绘制时间序列数据,可以使用matplotlib.dates.strpdate2num将字符串转换为matplotlib datenums,然后使用plot_date()

from matplotlib import dates
from matplotlib import pyplot as pp

# convert date and time strings to matplotlib datenums
dtconv = dates.strpdate2num('%Y-%m-%d%H:%M')
datenums = [dtconv(d+t) for d,t in zip(arr1['Date'],arr1['Time'])]

# use plot_date to plot timeseries
pp.plot_date(datenums,arr1['Speed'],'-ob')

你还应该看一下Pandas,它有一些nice tools for visualising timeseries data

问题2

您无法设置names的{​​{1}}因为它不是结构化数组(见上文)。