我有一个制表符分隔的数据集,在数据类型方面加载到Excel中看起来像这样,但是尺寸为83 x 23275.如您所见,此数据集是混合类型,第0行和类型的类型串。
"A" "B" "C" "D"
"2000-01-01" 0.469112 -0.282863 -1.509059 -1.135632
"2000-01-02" 1.212112 -0.173215 0.119209 -1.044236
"2000-01-03" -0.861849 -2.104569 -0.494929 1.071804
"2000-01-04" 0.721555 -0.706771 -1.039575 0.271860
"2000-01-05" -0.424972 0.567020 0.276232 -1.087401
"2000-01-06" -0.673690 0.113648 -1.478427 0.524988
"2000-01-07" 0.404705 0.577046 -1.715002 -1.039268
"2000-01-08" -0.370647 -1.157892 -1.344312 0.844885
使用pandas或numpy,如何加载此类数据并访问与其正确标签关联的数字?我甚至会满意数据类型字符串的两个标签向量(行和列的长度为83和23275),然后是float64的矩阵(浮点数据为82x23274)。
我分别将文件加载到numpy和pandas中,但是没有成功访问我的任何数据。
import numpy as np
import pandas as pd
#numpy
path = "C:/nature13173-s4.txt"
e18 = np.genfromtxt(path, delimiter = '\t')
print(e18.shape)
#pandas
df=pd.read_csv(path, sep='\t',header=None)
答案 0 :(得分:3)
我将您的样本粘贴到文件中。这不会很好地传递标签,所以我使用默认的“空格”分隔符。
我的第一次尝试:
e18=np.genfromtxt('stack33662863.txt',names=True)
给了我错误
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
那是因为第一行有4个列标题,但日期列没有。所以我必须跳过标题行并给出我自己的字段名称:
In [624]: e18=np.genfromtxt('stack33662863.txt',names=['date','A','B','C','D'],skip_header=1)
In [625]: e18
Out[625]:
array([(nan, 0.469112, -0.282863, -1.509059, -1.135632),
(nan, 1.212112, -0.173215, 0.119209, -1.044236),
(nan, -0.861849, -2.104569, -0.494929, 1.071804),
(nan, 0.721555, -0.706771, -1.039575, 0.27186),
(nan, -0.424972, 0.56702, 0.276232, -1.087401),
(nan, -0.67369, 0.113648, -1.478427, 0.524988),
(nan, 0.404705, 0.577046, -1.715002, -1.039268),
(nan, -0.370647, -1.157892, -1.344312, 0.844885)],
dtype=[('date', '<f8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8')])
几乎 - 除了'日期'列nan
。让我们使用dtype=None
告诉它推导出每列的数据类型,而不是假设所有都是浮点数。一个替代方案是为每列提供一个dtype。
In [626]: e18=np.genfromtxt('stack33662863.txt',names=['date','A','B','C','D'],skip_header=1,dtype=None)
In [627]: e18
Out[627]:
array([(b'"2000-01-01"', 0.469112, -0.282863, -1.509059, -1.135632),
(b'"2000-01-02"', 1.212112, -0.173215, 0.119209, -1.044236),
(b'"2000-01-03"', -0.861849, -2.104569, -0.494929, 1.071804),
(b'"2000-01-04"', 0.721555, -0.706771, -1.039575, 0.27186),
(b'"2000-01-05"', -0.424972, 0.56702, 0.276232, -1.087401),
(b'"2000-01-06"', -0.67369, 0.113648, -1.478427, 0.524988),
(b'"2000-01-07"', 0.404705, 0.577046, -1.715002, -1.039268),
(b'"2000-01-08"', -0.370647, -1.157892, -1.344312, 0.844885)],
dtype=[('date', 'S12'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8')])
看起来很不错。数据就在那里。
我可以使用以下方法访问属性和值:
In [628]: e18.shape
Out[628]: (8,)
In [629]: e18.dtype
Out[629]: dtype([('date', 'S12'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8')])
In [630]: e18['date']
Out[630]:
array([b'"2000-01-01"', b'"2000-01-02"', b'"2000-01-03"', b'"2000-01-04"',
b'"2000-01-05"', b'"2000-01-06"', b'"2000-01-07"', b'"2000-01-08"'],
dtype='|S12')
In [631]: e18['A']
Out[631]:
array([ 0.469112, 1.212112, -0.861849, 0.721555, -0.424972, -0.67369 ,
0.404705, -0.370647])
另一种选择是加载没有names
In [636]: e18=np.genfromtxt('stack33662863.txt',skip_header=1)
In [637]: e18.shape
Out[637]: (8, 5)
In [638]: e18[:3,:]
Out[638]:
array([[ nan, 0.469112, -0.282863, -1.509059, -1.135632],
[ nan, 1.212112, -0.173215, 0.119209, -1.044236],
[ nan, -0.861849, -2.104569, -0.494929, 1.071804]])
现在它是所有浮点数,一个二维数组,但在第一个日期列中有nan。我们可以将其切掉,以获得一个漂亮的二维数组:
In [639]: e18[:,1:]
Out[639]:
array([[ 0.469112, -0.282863, -1.509059, -1.135632],
[ 1.212112, -0.173215, 0.119209, -1.044236],
[-0.861849, -2.104569, -0.494929, 1.071804],
[ 0.721555, -0.706771, -1.039575, 0.27186 ],
[-0.424972, 0.56702 , 0.276232, -1.087401],
[-0.67369 , 0.113648, -1.478427, 0.524988],
[ 0.404705, 0.577046, -1.715002, -1.039268],
[-0.370647, -1.157892, -1.344312, 0.844885]])
我可以使用usecols
获得相同的数组。在真实数据中有更多列,这可能不太好(但可以随意尝试):
e18=np.genfromtxt('stack33662863.txt',skip_header=1,usecols=range(1,5))
您可以单独加载日期:
In [647]: np.genfromtxt('stack33662863.txt',skip_header=1,usecols=0,dtype=None)
Out[647]:
array([b'"2000-01-01"', b'"2000-01-02"', b'"2000-01-03"', b'"2000-01-04"',
b'"2000-01-05"', b'"2000-01-06"', b'"2000-01-07"', b'"2000-01-08"'],
dtype='|S12')
又一个选项 - 使用所有数字列定义dtype
:
In [654]: dt=np.dtype([('date','S12'),('data','float',(4,))])
In [655]: e18=np.genfromtxt('stack33662863.txt',skip_header=1,dtype=dt)
In [656]: e18['date']
Out[656]:
array([b'"2000-01-01"', b'"2000-01-02"', b'"2000-01-03"', b'"2000-01-04"',
b'"2000-01-05"', b'"2000-01-06"', b'"2000-01-07"', b'"2000-01-08"'],
dtype='|S12')
现在您可以将数字部分检索为2d数组:
In [658]: e18['data']
Out[658]:
array([[ 0.469112, -0.282863, -1.509059, -1.135632],
[ 1.212112, -0.173215, 0.119209, -1.044236],
[-0.861849, -2.104569, -0.494929, 1.071804],
[ 0.721555, -0.706771, -1.039575, 0.27186 ],
[-0.424972, 0.56702 , 0.276232, -1.087401],
[-0.67369 , 0.113648, -1.478427, 0.524988],
[ 0.404705, 0.577046, -1.715002, -1.039268],
[-0.370647, -1.157892, -1.344312, 0.844885]])
答案 1 :(得分:3)
实际上这对我使用pandas很好,而不是使用\s+
作为分隔符而不是tab
,似乎tab
字符不是你的情况下的分隔符
In [10]:
pd.read_csv('C:/nature13173-s4.txt' , sep = '\s+')
Out[10]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885