如何将制表符分隔的混合数据类型文件加载到numpy或pandas中?

时间:2015-11-12 01:18:09

标签: python numpy pandas

我有一个制表符分隔的数据集,在数据类型方面加载到Excel中看起来像这样,但是尺寸为83 x 23275.如您所见,此数据集是混合类型,第0行和类型的类型串。

               "A"         "B"         "C"         "D"
"2000-01-01"  0.469112 -0.282863 -1.509059 -1.135632
"2000-01-02"  1.212112 -0.173215  0.119209 -1.044236
"2000-01-03" -0.861849 -2.104569 -0.494929  1.071804
"2000-01-04"  0.721555 -0.706771 -1.039575  0.271860
"2000-01-05" -0.424972  0.567020  0.276232 -1.087401
"2000-01-06" -0.673690  0.113648 -1.478427  0.524988
"2000-01-07"  0.404705  0.577046 -1.715002 -1.039268
"2000-01-08" -0.370647 -1.157892 -1.344312  0.844885

使用pandas或numpy,如何加载此类数据并访问与其正确标签关联的数字?我甚至会满意数据类型字符串的两个标签向量(行和列的长度为83和23275),然后是float64的矩阵(浮点数据为82x23274)。

我分别将文件加载到numpy和pandas中,但是没有成功访问我的任何数据。

import numpy as np
import pandas as pd

#numpy
path = "C:/nature13173-s4.txt"
e18 = np.genfromtxt(path, delimiter = '\t')
print(e18.shape)

#pandas
df=pd.read_csv(path, sep='\t',header=None)

2 个答案:

答案 0 :(得分:3)

我将您的样本粘贴到文件中。这不会很好地传递标签,所以我使用默认的“空格”分隔符。

我的第一次尝试:

e18=np.genfromtxt('stack33662863.txt',names=True)

给了我错误

ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)

那是因为第一行有4个列标题,但日期列没有。所以我必须跳过标题行并给出我自己的字段名称:

In [624]: e18=np.genfromtxt('stack33662863.txt',names=['date','A','B','C','D'],skip_header=1)
In [625]: e18
Out[625]: 
array([(nan, 0.469112, -0.282863, -1.509059, -1.135632),
       (nan, 1.212112, -0.173215, 0.119209, -1.044236),
       (nan, -0.861849, -2.104569, -0.494929, 1.071804),
       (nan, 0.721555, -0.706771, -1.039575, 0.27186),
       (nan, -0.424972, 0.56702, 0.276232, -1.087401),
       (nan, -0.67369, 0.113648, -1.478427, 0.524988),
       (nan, 0.404705, 0.577046, -1.715002, -1.039268),
       (nan, -0.370647, -1.157892, -1.344312, 0.844885)], 
      dtype=[('date', '<f8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8')])

几乎 - 除了'日期'列nan。让我们使用dtype=None告诉它推导出每列的数据类型,而不是假设所有都是浮点数。一个替代方案是为每列提供一个dtype。

In [626]: e18=np.genfromtxt('stack33662863.txt',names=['date','A','B','C','D'],skip_header=1,dtype=None)
In [627]: e18
Out[627]: 
array([(b'"2000-01-01"', 0.469112, -0.282863, -1.509059, -1.135632),
       (b'"2000-01-02"', 1.212112, -0.173215, 0.119209, -1.044236),
       (b'"2000-01-03"', -0.861849, -2.104569, -0.494929, 1.071804),
       (b'"2000-01-04"', 0.721555, -0.706771, -1.039575, 0.27186),
       (b'"2000-01-05"', -0.424972, 0.56702, 0.276232, -1.087401),
       (b'"2000-01-06"', -0.67369, 0.113648, -1.478427, 0.524988),
       (b'"2000-01-07"', 0.404705, 0.577046, -1.715002, -1.039268),
       (b'"2000-01-08"', -0.370647, -1.157892, -1.344312, 0.844885)], 
      dtype=[('date', 'S12'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8')])

看起来很不错。数据就在那里。

我可以使用以下方法访问属性和值:

In [628]: e18.shape
Out[628]: (8,)
In [629]: e18.dtype
Out[629]: dtype([('date', 'S12'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8'), ('D', '<f8')])
In [630]: e18['date']
Out[630]: 
array([b'"2000-01-01"', b'"2000-01-02"', b'"2000-01-03"', b'"2000-01-04"',
       b'"2000-01-05"', b'"2000-01-06"', b'"2000-01-07"', b'"2000-01-08"'], 
      dtype='|S12')
In [631]: e18['A']
Out[631]: 
array([ 0.469112,  1.212112, -0.861849,  0.721555, -0.424972, -0.67369 ,
        0.404705, -0.370647])

另一种选择是加载没有names

的数据
In [636]: e18=np.genfromtxt('stack33662863.txt',skip_header=1)
In [637]: e18.shape
Out[637]: (8, 5)
In [638]: e18[:3,:]
Out[638]: 
array([[      nan,  0.469112, -0.282863, -1.509059, -1.135632],
       [      nan,  1.212112, -0.173215,  0.119209, -1.044236],
       [      nan, -0.861849, -2.104569, -0.494929,  1.071804]])

现在它是所有浮点数,一个二维数组,但在第一个日期列中有nan。我们可以将其切掉,以获得一个漂亮的二维数组:

In [639]: e18[:,1:]
Out[639]: 
array([[ 0.469112, -0.282863, -1.509059, -1.135632],
       [ 1.212112, -0.173215,  0.119209, -1.044236],
       [-0.861849, -2.104569, -0.494929,  1.071804],
       [ 0.721555, -0.706771, -1.039575,  0.27186 ],
       [-0.424972,  0.56702 ,  0.276232, -1.087401],
       [-0.67369 ,  0.113648, -1.478427,  0.524988],
       [ 0.404705,  0.577046, -1.715002, -1.039268],
       [-0.370647, -1.157892, -1.344312,  0.844885]])

我可以使用usecols获得相同的数组。在真实数据中有更多列,这可能不太好(但可以随意尝试):

e18=np.genfromtxt('stack33662863.txt',skip_header=1,usecols=range(1,5))

您可以单独加载日期:

In [647]: np.genfromtxt('stack33662863.txt',skip_header=1,usecols=0,dtype=None)
Out[647]: 
array([b'"2000-01-01"', b'"2000-01-02"', b'"2000-01-03"', b'"2000-01-04"',
       b'"2000-01-05"', b'"2000-01-06"', b'"2000-01-07"', b'"2000-01-08"'], 
      dtype='|S12')

又一个选项 - 使用所有数字列定义dtype

In [654]: dt=np.dtype([('date','S12'),('data','float',(4,))])
In [655]: e18=np.genfromtxt('stack33662863.txt',skip_header=1,dtype=dt)
In [656]: e18['date']
Out[656]: 
array([b'"2000-01-01"', b'"2000-01-02"', b'"2000-01-03"', b'"2000-01-04"',
       b'"2000-01-05"', b'"2000-01-06"', b'"2000-01-07"', b'"2000-01-08"'], 
      dtype='|S12')

现在您可以将数字部分检索为2d数组:

In [658]: e18['data']
Out[658]: 
array([[ 0.469112, -0.282863, -1.509059, -1.135632],
       [ 1.212112, -0.173215,  0.119209, -1.044236],
       [-0.861849, -2.104569, -0.494929,  1.071804],
       [ 0.721555, -0.706771, -1.039575,  0.27186 ],
       [-0.424972,  0.56702 ,  0.276232, -1.087401],
       [-0.67369 ,  0.113648, -1.478427,  0.524988],
       [ 0.404705,  0.577046, -1.715002, -1.039268],
       [-0.370647, -1.157892, -1.344312,  0.844885]])

答案 1 :(得分:3)

实际上这对我使用pandas很好,而不是使用\s+作为分隔符而不是tab,似乎tab字符不是你的情况下的分隔符

In [10]:
pd.read_csv('C:/nature13173-s4.txt' , sep = '\s+')
Out[10]:
               A           B            C          D
2000-01-01  0.469112    -0.282863   -1.509059   -1.135632
2000-01-02  1.212112    -0.173215   0.119209    -1.044236
2000-01-03  -0.861849   -2.104569   -0.494929   1.071804
2000-01-04  0.721555    -0.706771   -1.039575   0.271860
2000-01-05  -0.424972   0.567020    0.276232    -1.087401
2000-01-06  -0.673690   0.113648    -1.478427   0.524988
2000-01-07  0.404705    0.577046    -1.715002   -1.039268
2000-01-08  -0.370647   -1.157892   -1.344312   0.844885