从具有可变列数的ASCII文件中读取浮点值

时间:2014-06-14 19:48:20

标签: python numpy cython

我有包含浮点数的ASCII文件。大多数行有10列,但有些行的列数较少。一个例子是:

* lat =   33.2813
  19.61  19.92  21.82  21.94  22.77  25.81  29.48  29.86  29.92  28.98
  27.94  25.78  23.68  23.37
* lat =   33.3438
  20.16  23.62  27.73  31.12  33.06  34.01  35.78  37.03  37.79  35.74
  34.12  31.83  33.98  28.57
* lat =   33.4063
  28.26  30.04  35.00  37.92  41.50  44.55  45.44  46.74  46.74  43.47
  37.67  35.67  35.67  31.64
* lat =   33.4688
  34.02  36.07  38.95  44.24  46.49  47.98  50.62  51.95  51.95  51.95
  48.31  41.03  38.01  34.58
* lat =   33.5313
  36.94  37.12  44.04  48.41  51.70  52.71  54.18  55.71  56.98  62.10
  57.26  49.05  44.18  41.50

*开头的行是注释。

如何使用numpy有效地阅读此文件? (这是一个玩具示例;我的实际数据文件中包含>> 1E6值)。 numpy函数loadtxt / genfromtxt似乎无法应对可变数量的列:

   In [25]: np.loadtxt(fn, comments="*", dtype=float)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-2419eebb6114> in <module>()
----> 1 np.loadtxt(fn, comments="*", dtype=float)

/usr/lib/pymodules/python2.7/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    833             fh.close()
    834 
--> 835     X = np.array(X, dtype)
    836     # Multicolumn data are returned with shape (1, N, M), i.e.
    837     # (1, 1, M) for a single row - remove the singleton dimension there

ValueError: setting an array element with a sequence.

genfromtxt更详细,但也不起作用:

    In [27]: np.genfromtxt(fn, comments="*", dtype=float)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-6c6e8879e4b9> in <module>()
----> 1 np.genfromtxt(fn, comments="*", dtype=float)

/usr/lib/pymodules/python2.7/numpy/lib/npyio.pyc in genfromtxt(fname, dtype, comments, delimiter, skiprows, skip_header, skip_footer, converters, missing, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise)
   1636             # Raise an exception ?
   1637             if invalid_raise:
-> 1638                 raise ValueError(errmsg)
   1639             # Issue a warning ?
   1640             else:

ValueError: Some errors were detected !
    Line #2 (got 4 columns instead of 10)
    Line #5 (got 4 columns instead of 10)
    Line #8 (got 4 columns instead of 10)
    Line #11 (got 4 columns instead of 10)
    Line #14 (got 4 columns instead of 10)
    Line #17 (got 4 columns instead of 10)
    Line #20 (got 4 columns instead of 10)
    Line #23 (got 4 columns instead of 10)
    Line #26 (got 4 columns instead of 10)
    Line #29 (got 4 columns instead of 10)

似乎有一个kwarg invalid_raise,但将其设置为False会导致忽略少于10个值的行。

我很感激帮助解决这个问题。我很高兴在Cython中编写自己的文件解析器,但是在Cython中找不到有关字符串 - &gt;浮点转换的信息......

1 个答案:

答案 0 :(得分:3)

这是使用pandas解析器的方法。如果您只想要numpy数组,请选择df.values

In [239]: import pandas as pd

In [240]: df = pd.read_csv('input.txt', header=None, skiprows=1, delim_whitespace=True)

In [242]: df = df[df[0] != '*']  #filter out comment rows

In [245]: df = df.convert_objects(convert_numeric=True)

In [246]: df
Out[246]: 
        0      1      2      3      4      5      6      7      8      9
0   19.61  19.92  21.82  21.94  22.77  25.81  29.48  29.86  29.92  28.98
1   27.94  25.78  23.68  23.37    NaN    NaN    NaN    NaN    NaN    NaN
3   20.16  23.62  27.73  31.12  33.06  34.01  35.78  37.03  37.79  35.74
4   34.12  31.83  33.98  28.57    NaN    NaN    NaN    NaN    NaN    NaN
6   28.26  30.04  35.00  37.92  41.50  44.55  45.44  46.74  46.74  43.47
7   37.67  35.67  35.67  31.64    NaN    NaN    NaN    NaN    NaN    NaN
9   34.02  36.07  38.95  44.24  46.49  47.98  50.62  51.95  51.95  51.95
10  48.31  41.03  38.01  34.58    NaN    NaN    NaN    NaN    NaN    NaN
12  36.94  37.12  44.04  48.41  51.70  52.71  54.18  55.71  56.98  62.10
13  57.26  49.05  44.18  41.50    NaN    NaN    NaN    NaN    NaN    NaN