pandas read_table中的usecols导致“list index超出范围”

时间:2017-08-29 15:46:53

标签: python pandas

我想用pandas解析一些数据时只选择2列。

pd.read_table的帮助提到usecols选项似乎正是我想要的:

usecols : array-like, default None
    Return a subset of the columns. All elements in this array must either
    be positional (i.e. integer indices into the document columns) or strings
    that correspond to column names provided either by the user in `names` or
    inferred from the document header row(s). For example, a valid `usecols`
    parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
    results in much faster parsing time and lower memory usage.

我的数据一旦读取,就会显示编号为0到6的列:

In [338]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
     ...: col=3, header=None)[:3]
Out[338]: 
                0      1      2  4  5    6
3                                         
WBGene00022277  I   4118  10230  -  .   83
WBGene00022276  I  10412  16842  +  .  230
WBGene00022278  I  17482  26781  -  .  303

但是当我尝试仅保留索引(第3列)和最后一个(第6列)时,我收到以下错误:

In [339]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_
     ...: col=3, header=None, usecols=(3, 6))[:3]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-339-279bef505f16> in <module>()
----> 1 pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", index_col=3, header=None, usecols=(3, 6))[:3]

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    644                     delim_whitespace=delim_whitespace,
    645                     as_recarray=as_recarray,
--> 646                     warn_bad_lines=warn_bad_lines,
    647                     error_bad_lines=error_bad_lines,
    648                     low_memory=low_memory,

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    387         kwds['encoding'] = encoding
    388 
--> 389     compression = kwds.get('compression')
    390     compression = _infer_compression(filepath_or_buffer, compression)
    391     filepath_or_buffer, _, compression = get_filepath_or_buffer(

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    728 
    729                 if dialect_val != provided:
--> 730                     conflict_msgs.append((
    731                         "Conflicting values for '{param}': '{val}' was "
    732                         "provided, but the dialect specifies '{diaval}'. "

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    921         for arg in _deprecated_args:
    922             parser_default = _c_parser_defaults[arg]
--> 923             msg = ("The '{arg}' argument has been deprecated "
    924                    "and will be removed in a future version."
    925                    .format(arg=arg))

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1445                 cast_type = dtypes
   1446 
-> 1447             if self.na_filter:
   1448                 col_na_values, col_na_fvalues = _get_na_values(
   1449                     c, na_values, na_fvalues)

/home/bli/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _clean_index_names(columns, index_col)
   2812                 msg = ('Expected %d fields in line %d, saw %d' %
   2813                        (col_len, row_num + 1, actual_len))
-> 2814                 if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:
   2815                     # see gh-13374
   2816                     reason = ('Error could possibly be due to quotes being '

IndexError: list index out of range

我在另一种情况下成功使用了usecols选项,但保留了原始文件中的一些标题。

造成问题的原因是什么?

编辑:header=None显然不是问题

我可以解析格式不同的文件,而不保留标题,usecols选项有效:

In [361]: pd.read_table("../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/feature_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt", skiprows
     ...: =2, index_col=0, header=None, usecols=[0, 6])[:3]
Out[361]: 
                  6
0                  
WBGene00022277   72
WBGene00022276  222
WBGene00022278  302

1 个答案:

答案 0 :(得分:1)

我看起来与index_col

有关

尝试读取文件后设置索引:

path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6)).set_index(3)[:3]

显然在减少列后使用index_col。您正在选择两列,然后尝试选择第三列作为索引。

path = "../RNA_Seq_analyses/mapping_worm_number_tests/hisat2/mapped_C_elegans/intersect_count/W100_1_on_C_elegans/protein_coding_fwd_counts.txt"
df = pd.read_table(path, header=None, usecols=(3, 6), index_col=0)[:3]