numpy genfromtxt中的dtype参数

时间:2014-02-02 23:56:37

标签: python arrays numpy genfromtxt

我正在尝试从以下文件内容创建MX2 numpy矩阵或数组:

shell: head WORLD#America.csv
"2013-04-17 12","3","WORLD","#America"
"2013-04-17 13","9","WORLD","#America"
"2013-04-17 14","4","WORLD","#America"
"2013-04-17 15","3","WORLD","#America"
"2013-04-17 16","7","WORLD","#America"
"2013-04-17 17","8","WORLD","#America"
"2013-04-17 18","6","WORLD","#America"
"2013-04-17 19","6","WORLD","#America"
"2013-04-17 20","6","WORLD","#America"
"2013-04-17 21","2","WORLD","#America"

我遇到了genfromtxt()函数,但在提取数据方面却没有成功。使用名为f的文件,我尝试了以下操作:ts = genfromtxt(f, delimiter=",")并使用nan填充了一个数组。这只是第一次尝试,因此我阅读了有关dtype参数的文档,该参数指定了数组的数据类型。看来要获得包含(datetime, int)形式条目的MX2矩阵,我会得到以下内容:dtype=[('f1', datetime64), ('f2', uint)]。当我这样做时,我将以下内容分配给变量ts

(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L),
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 18446744073709551615L)],
dtype=[('f1', ('<M8[us]', {})), ('f2', '<u8')])

我为矩阵得到的每个值都是常数...为什么不从我的文件中读取? 显然这不是我应该得到的输出。

如何获得所需的MX2矩阵或数组,第一列为datetime,第二列为整数,如head命令所示?

1 个答案:

答案 0 :(得分:0)

正如评论中所指出的,使用genfromtxt读取此文件的一个难点是引用字符的存在。也许最好只是(以编程方式)删除引号,但也可以围绕这个问题作弊:将引号字符指定为分隔符:

np.genfromtxt(filename, delimiter='"', dtype=str, comments=None)[0]
# array(['', '2013-04-17 12', ',', '3', ',', 'WORLD', ',', '#America', ''], 
#       dtype='|S13')

现在该文件被解释为有9列,其中第二列和第四列包含感兴趣的数据。

另一个问题是为日期时间列指定dtype。在Numpy的最近(?)版本中,您必须指定时间/日期单位或genfromtxt引发错误。在这种情况下,显然您需要使用M8[h]作为dtype,以指定每小时单位。

总而言之,我能够通过以下方式加载文件:

ts = np.genfromtxt(filename, 
                   delimiter='"', 
                   dtype='M8[h], uint', 
                   usecols=[1,3])

或者,您可以查看using a converter或尝试the CSV reader from Pandas