我有一个像这样的数据帧['anno']:
df.anno
0 type I secretion outer membrane protein, TolC...
1 conserved hypothetical protein [Shigella boyd...
2 Transposase [Congregibacter litoralis KT71]
3 Chain A, The Crystal Structure Of Chlorite Di...
4 chlorite dismutase, partial [uncultured bacte...
5 carbamoyl-phosphate synthase, small subunit [...
6 anthranilate synthase component 1 [endosymbio...
7 chlorite dismutase, partial [bacterium enrich...
8 peptidase dimerization domain protein [Myroid...
9 MULTISPECIES: MFS transporter [Enterobacteria...
10 CAAX amino terminal protease family protein [...
11 Fe-S oxidoreductase [Desulfovibrio africanus ...
12 phosphoenolpyruvate synthase/pyruvate phospha...
因为每行有两个部分:1:蛋白质名称。 2.具有'[......]'的微生物种。
我想提取蛋白质名称部分并丢弃微生物种类,因此我决定首先将该列分成两列,位于'['。
df2 = pd.DataFrame(df.anno.str.split("[", 1).tolist(), columns = ['protein','species'])
它返回错误:
TypeError: object of type 'NoneType' has no len()
我也尝试过:
df[['protein','species']] = df['anno'].str.split('[', expand=True)
它还返回了一个错误:
ValueError: Columns must be same length as key
怎么做?有没有其他方法来提取蛋白质名称? 谢谢!
答案 0 :(得分:1)
我认为多个[
存在问题,因此将n=1
添加到split
以便先按[
进行拆分。要删除上一次]
使用rstrip
:
df[['protein','species']] = df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)
适用于上次[
使用rsplit
:
df[['protein','species']] = df['anno'].str.rstrip(']').str.rsplit('[', expand=True, n=1)
另一个extract
的解决方案,用于按最后[]
提取:
df[['protein','species']] = df['anno'].str.extract('(.*)\[(.*)\]', expand=True)
样品:
df[['protein','species']] = df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)
df['species'] = df['species'].str.replace('\]\[',',')
df['protein'] = df['protein'].str.strip()
print (df)
anno protein species
0 protein [q][sd] protein q,sd
1 protein protein None
2 Transposase [KT71] Transposase KT71
3 None None None