Question

我有一个.txt文件，其中包含不同长度的行。每行是表示一个轨迹的系列点。由于每个轨迹都有自己的长度，因此行的长度都不同。也就是说，列数因行而异。

AFAIK，Python中的genfromtxt()模块要求列的数量相同。

>>> import numpy as np
>>> 
>>> data=np.genfromtxt('deer_1995.txt', skip_header=2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 1638, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #4 (got 2352 columns instead of 1824)
    Line #5 (got 2182 columns instead of 1824)
    Line #6 (got 1412 columns instead of 1824)
    Line #7 (got 1650 columns instead of 1824)
    Line #8 (got 1688 columns instead of 1824)
    Line #9 (got 1500 columns instead of 1824)
    Line #10 (got 1208 columns instead of 1824)

It is also able to fill the missing values by the help of filling_values。但是，我认为这会带来不必要的麻烦，我希望避免这种麻烦。

那么在不填写“缺失值”的情况下，简单地导入此数据集的最佳（Pythonic）方法是什么？

Answer 1

Numpy.genfromtxt不处理可变长度的行，因为numpy只适用于数组和矩阵（固定的行/列大小）。

您需要手动解析数据。例如：

数据（基于csv）：

0.613 ;  5.919 
0.615 ;  5.349
0.615 ;  5.413
0.617 ;  6.674
0.617 ;  6.616
0.63 ;   7.418
0.642 ;  7.809 ; 5.919
0.648 ;  8.04
0.673 ;  8.789
0.695 ;  9.45
0.712 ;  9.825
0.734 ;  10.265
0.748 ;  10.516
0.764 ;  10.782
0.775 ;  10.979
0.783 ;  11.1
0.808 ;  11.479
0.849 ;  11.951
0.899 ;  12.295
0.951 ;  12.537
0.972 ;  12.675
1.038 ;  12.937
1.098 ;  13.173
1.162 ;  13.464
1.228 ;  13.789
1.294 ;  14.126
1.363 ;  14.518
1.441 ;  14.969
1.545 ;  15.538
1.64 ;   16.071
1.765 ;  16.7
1.904 ;  17.484
2.027 ;  18.36
2.123 ;  19.235
2.149 ;  19.655
2.172 ;  20.096
2.198 ;  20.528
2.221 ;  20.945
2.265 ;  21.352
2.312 ;  21.76
2.365 ;  22.228
2.401 ;  22.836
2.477 ;  23.804

解析器：

import csv
datafile = open('i.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:
    # I split the input string based on the comma separator, and cast every elements into a float
    data.append( [ float(elem) for elem in row[0].split(";") ] )

print data

Python中genfromtxt（）的可变列数？

1 个答案: