pandas read_table()函数的文本处理可能存在不一致

时间:2012-08-20 02:04:06

标签: python pandas

previous post中,我发现如果使用read_table()构造,pandas read_table('datafile', sep=r'\s*')函数可以将变长空白作为分隔符处理。虽然这对我的许多文件都很有用,但尽管高度相似,但它并不适用于其他文件。

编辑: 我发布的示例在其他人尝试时无法复制问题。所以我发布了AY907538AY942707原始文件的链接,并留下了我无法解决的错误消息。

## filename:AY942707
# this will load with no problem
data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')

## filename: AY907538
data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')

会产生以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-131d10d1fb1d> in <module>()
      2 
      3 #temp = get_dataset('AY907538.hmmdomtblout')
----> 4 data = read_table('AY907538.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')
      5 #data = read_table('AY942707.hmmdomtblout', header=None, skiprows=3, sep=r'\s*')

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in read_table(filepath_or_buffer, sep, dialect, header, index_col, names, skiprows, na_values, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, nrows, iterator, chunksize, skip_footer, converters, verbose, delimiter, encoding, squeeze)
    282     kwds['encoding'] = None
    283 
--> 284     return _read(TextParser, filepath_or_buffer, kwds)
    285 
    286 @Appender(_read_fwf_doc)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(cls, filepath_or_buffer, kwds)
    189         return parser
    190 
--> 191     return parser.get_chunk()
    192 
    193 @Appender(_read_csv_doc)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.pyc in get_chunk(self, rows)
    779             msg = ('Expecting %d columns, got %d in row %d' %
    780                    (col_len, zip_len, row_num))
--> 781             raise ValueError(msg)
    782 
    783         data = dict((k, v) for k, v in izip(self.columns, zipped_content))

ValueError: Expecting 26 columns, got 28 in row 6

1 个答案:

答案 0 :(得分:1)

两个文件中的最后一个字段description of target包含多个单词。由于空格用作分隔符,因此read_table不会将description of target视为单个列。此字段中的每个单词都在不同的列中。在AY942707中,第一个description of target包含的字数多于其他所有字数,但在AY907538中并非如此。 read_table确定第一行中的列数,所有后续行应具有相等或更少的列数。