背景:
我有一个数据框,其列如下所示:
>>> merge_df['AAChange']
0 STK11:NM_000455:exon1:c.148_149TG
Name: AAChange, dtype: object
我需要将其拆分为':'角色,像这样:
>>> new_cols = merge_df['AAChange'].str.split(':').apply(pd.Series,1)
>>> new_cols
0 1 2 3
0 STK11 NM_000455 exon1 c.148_149TG
然后我需要重命名列,所以我将新名称存储在列表中:
>>> new_colnames = ['Gene.AA', 'Transcript', 'Exon', 'Coding', 'Amino Acid Change']
但是,存在一个问题:输出中必须存在所有这5列,但在此数据条目中,源数据中缺少一个字段,只留下4个字段。因此,尝试重命名列失败:
>>> new_cols.columns = new_colnames
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/generic.py", line 2371, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:45002)
File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/generic.py", line 425, in _set_axis
self._data.set_axis(axis, labels)
File "/local/apps/python/2.7.3/lib/python2.7/site-packages/pandas/core/internals.py", line 2572, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 5 elements
因此,我想为每个缺少的列添加一个空列,并同时更改列名。 This answer似乎有一个很好的解决方案;根据新列列表重新编制索引。但是,它没有给出预期的结果:
>>> new_cols.reindex(columns = new_colnames)
Gene.AA Transcript Exon Coding Amino Acid Change
0 NaN NaN NaN NaN NaN
现在我已经找到了所有缺失的列,但原始数据已丢失。有没有更好的解决方案可以让我重命名现有列并添加所有缺少的列?
所需的输出如下所示:
>>> new_cols.reindex(columns = new_colnames)
Gene.AA Transcript Exon Coding Amino Acid Change
0 STK11 NM_000455 exon1 c.148_149TG NaN
答案 0 :(得分:0)
您可以使用前导所需的名称重命名原始列名称。
new_cols.columns = new_colnames[:-1]
# new_cols
Gene.AA Transcript Exon Coding
0 STK11 NM_000455 exon1 c.148_149TG
然后,通过以下命令插入额外的一个。它将新列作为#4列插入,并使用nan
值填充它。
new_cols.insert(4, new_colnames[-1], [np.nan]*len(new_cols.index))
# new_cols
Gene.AA Transcript Exon Coding Amino Acid Change
0 STK11 NM_000455 exon1 c.148_149TG NaN