如何在大火中读取制表符分隔的CSV?

时间:2015-09-22 11:49:37

标签: pandas blaze dask

我有一个“CSV”数据文件,格式如下(好吧,它更像是TSV):

event  pdg x   y   z   t   px  py  pz  ekin
3383    11  -161.515    5.01938e-05 -0.000187112    0.195413    0.664065    0.126078    -0.736968   0.00723234  
1694    11  -161.515    -0.000355633    0.000263174 0.195413    0.511853    -0.523429   0.681196    0.00472714  
4228    11  -161.535    6.59631e-06 -3.32796e-05    0.194947    -0.713983   -0.0265468  -0.69966    0.0108681   
4233    11  -161.515    -0.000524488    6.5069e-05  0.195413    0.942642    0.331324    0.0406377   0.017594

此文件可在pandas

中按原样解释
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False)     # Works
data = read_table("test.csv", index_col=False)             # Works

但是,当我尝试在blaze(声明使用pandas关键字参数)中读取它时,会抛出异常:

from blaze import Data 
Data("test.csv")                             # Attempt 1
Data("test.csv", sep="\t")                   # Attempt 2
Data("test.csv", sep="\t", index_col=False)  # Attempt 3

根本没有使用这些作品和熊猫。尝试推断列名称和类型的“嗅探器”仅从标准库调用csv.Sniffer.sniff()(失败)。

有没有办法如何在大火中正确读取此文件(鉴于其“小兄弟”有几百MB,我想使用大火的顺序处理功能)?

感谢任何想法。

修改:我认为这可能是odo / csv的问题并提出了问题:https://github.com/blaze/odo/issues/327

EDIT2: 完成错误:

Error Traceback (most recent call last)  in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
     54     if isinstance(data, _strtypes):
     55         data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56                         **kwargs)
     57     if (isinstance(data, Iterator) and
     58             not isinstance(data, tuple(not_an_iterator))):

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
     62 
     63     def __call__(self, s, *args, **kwargs):
---> 64         return self.dispatch(s)(s, *args, **kwargs)
     65 
     66     @property

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
    276 @resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
    277 def resource_csv(uri, **kwargs):
--> 278     return CSV(uri, **kwargs)
    279 
    280 

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
    102         if has_header is None:
    103             self.has_header = (not os.path.exists(path) or
--> 104                                infer_header(path, sniff_nbytes))
    105         else:
    106             self.has_header = has_header

/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
     58     with open_file(path, 'rb') as f:
     59         raw = f.read(nbytes)
---> 60     return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
     61 
     62 

/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
    392         # subtracting from the likelihood of the first row being a header.
    393 
--> 394         rdr = reader(StringIO(sample), self.sniff(sample))
    395 
    396         header = next(rdr) # assume first row is header

/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
    187 
    188         if not delimiter:
--> 189             raise Error("Could not determine delimiter")
    190 
    191         class dialect(Dialect):

Error: Could not determine delimiter

1 个答案:

答案 0 :(得分:3)

我正在使用 Python 2.7.10, dask v0.7.1, blaze v0.8.2和 conda v3.17.0。

conda install dask
conda install blaze

以下是一种导入数据以供 blaze 使用的方法。首先使用 pandas 解析数据,然后将其转换为 blaze 。也许这会破坏目的,但这种方式没有麻烦。

作为旁注,为了正确解析数据文件,你的pandas解析语句中的行应该是:

from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)

现在数据格式正确且无错误bdata

   event  pdg        x         y         z         t        px        py  \
0   3383   11 -161.515  0.000050 -0.000187  0.195413  0.664065  0.126078   
1   1694   11 -161.515 -0.000356  0.000263  0.195413  0.511853 -0.523429   
2   4228   11 -161.535  0.000007 -0.000033  0.194947 -0.713983 -0.026547   
3   4233   11 -161.515 -0.000524  0.000065  0.195413  0.942642  0.331324   

     pz      ekin  
0 -0.736968  0.007232  
1  0.681196  0.004727  
2 -0.699660  0.010868  

这是一个替代方案,使用dask,它可能可以执行相同的分块或您正在寻找的大规模处理。 Dask当然可以很容易地正确加载tsv格式。

In [17]: import dask.dataframe as dd

In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)

In [19]: df.head()
Out[19]: 
   event  pdg        x         y         z         t        px        py  \
0   3383   11 -161.515  0.000050 -0.000187  0.195413  0.664065  0.126078   
1   1694   11 -161.515 -0.000356  0.000263  0.195413  0.511853 -0.523429   
2   4228   11 -161.535  0.000007 -0.000033  0.194947 -0.713983 -0.026547   
3   4233   11 -161.515 -0.000524  0.000065  0.195413  0.942642  0.331324   
4    854   11 -161.515  0.000032  0.000418  0.195414  0.675752  0.315671   

         pz      ekin  
0 -0.736968  0.007232  
1  0.681196  0.004727  
2 -0.699660  0.010868  
3  0.040638  0.017594  
4 -0.666116  0.012641  

In [20]:

另请参阅:http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask