Question

我有一个像这样的数据帧['anno']：

df.anno

0         type I secretion outer membrane protein, TolC...
1         conserved hypothetical protein [Shigella boyd...
2              Transposase [Congregibacter litoralis KT71]
3         Chain A, The Crystal Structure Of Chlorite Di...
4         chlorite dismutase, partial [uncultured bacte...
5         carbamoyl-phosphate synthase, small subunit [...
6         anthranilate synthase component 1 [endosymbio...
7         chlorite dismutase, partial [bacterium enrich...
8         peptidase dimerization domain protein [Myroid...
9         MULTISPECIES: MFS transporter [Enterobacteria...
10        CAAX amino terminal protease family protein [...
11        Fe-S oxidoreductase [Desulfovibrio africanus ...
12        phosphoenolpyruvate synthase/pyruvate phospha...

因为每行有两个部分：1：蛋白质名称。 2.具有'[......]'的微生物种。

我想提取蛋白质名称部分并丢弃微生物种类，因此我决定首先将该列分成两列，位于'['。

df2 = pd.DataFrame(df.anno.str.split("[", 1).tolist(), columns = ['protein','species'])

它返回错误：

TypeError: object of type 'NoneType' has no len()

我也尝试过：

df[['protein','species']] =  df['anno'].str.split('[', expand=True)

它还返回了一个错误：

ValueError: Columns must be same length as key

怎么做？有没有其他方法来提取蛋白质名称？谢谢！

Answer 1

我认为多个[存在问题，因此将n=1添加到split以便先按[进行拆分。要删除上一次]使用rstrip：

df[['protein','species']] =  df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)

适用于上次[使用rsplit：

df[['protein','species']] =  df['anno'].str.rstrip(']').str.rsplit('[', expand=True, n=1)

另一个extract的解决方案，用于按最后[]提取：

df[['protein','species']] = df['anno'].str.extract('(.*)\[(.*)\]', expand=True)

样品：

df[['protein','species']] =  df['anno'].str.rstrip(']').str.split('[', expand=True, n=1) 
df['species'] = df['species'].str.replace('\]\[',',')
df['protein'] = df['protein'].str.strip()
print (df)
                 anno      protein species
0     protein [q][sd]      protein    q,sd
1             protein      protein    None
2  Transposase [KT71]  Transposase    KT71
3                None         None    None

为什么我不能将列拆分为熊猫中的两列？

1 个答案: