为什么我不能将列拆分为熊猫中的两列?

时间:2017-09-17 17:57:33

标签: python pandas split

我有一个像这样的数据帧['anno']:

df.anno

0         type I secretion outer membrane protein, TolC...
1         conserved hypothetical protein [Shigella boyd...
2              Transposase [Congregibacter litoralis KT71]
3         Chain A, The Crystal Structure Of Chlorite Di...
4         chlorite dismutase, partial [uncultured bacte...
5         carbamoyl-phosphate synthase, small subunit [...
6         anthranilate synthase component 1 [endosymbio...
7         chlorite dismutase, partial [bacterium enrich...
8         peptidase dimerization domain protein [Myroid...
9         MULTISPECIES: MFS transporter [Enterobacteria...
10        CAAX amino terminal protease family protein [...
11        Fe-S oxidoreductase [Desulfovibrio africanus ...
12        phosphoenolpyruvate synthase/pyruvate phospha...

因为每行有两个部分:1:蛋白质名称。 2.具有'[......]'的微生物种。

我想提取蛋白质名称部分并丢弃微生物种类,因此我决定首先将该列分成两列,位于'['。

df2 = pd.DataFrame(df.anno.str.split("[", 1).tolist(), columns = ['protein','species'])

它返回错误:

TypeError: object of type 'NoneType' has no len()

我也尝试过:

df[['protein','species']] =  df['anno'].str.split('[', expand=True) 

它还返回了一个错误:

ValueError: Columns must be same length as key

怎么做?有没有其他方法来提取蛋白质名称? 谢谢!

1 个答案:

答案 0 :(得分:1)

我认为多个[存在问题,因此将n=1添加到split以便先按[进行拆分。要删除上一次]使用rstrip

df[['protein','species']] =  df['anno'].str.rstrip(']').str.split('[', expand=True, n=1) 

适用于上次[使用rsplit

df[['protein','species']] =  df['anno'].str.rstrip(']').str.rsplit('[', expand=True, n=1) 

另一个extract的解决方案,用于按最后[]提取:

df[['protein','species']] = df['anno'].str.extract('(.*)\[(.*)\]', expand=True)

样品:

df[['protein','species']] =  df['anno'].str.rstrip(']').str.split('[', expand=True, n=1) 
df['species'] = df['species'].str.replace('\]\[',',')
df['protein'] = df['protein'].str.strip()
print (df)
                 anno      protein species
0     protein [q][sd]      protein    q,sd
1             protein      protein    None
2  Transposase [KT71]  Transposase    KT71
3                None         None    None