Python中的模式替换

时间:2017-06-05 09:10:03

标签: python regex python-2.7

寻找一些替代方法来清理包含括号之间信息的表格文件。 这将是包含在管道中的第一步,我需要删除括号内的所有值(包括括号)。

我有什么

> Otu00467  Bacteria(100);Gracilibacteria(99);unclassified(99);unclassified(99);unclassified(99);unclassified(99);
> Otu00469  Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470  Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);

我想要的是什么:

 Otu00467   Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00469   Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00470   Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;

我的第一个方法是将第二列拆分为“;” ,“(”,“)”并进一步加入一切。不错,但太丑了。

谢谢。

5 个答案:

答案 0 :(得分:1)

import re
new_string = re.sub(r'\(.*?\)', '', your_string)

答案 1 :(得分:1)

您可以使用正则表达式轻松完成此操作。

import re
text = open('file.txt').read()
text = re.sub(r'\(.*?\)', '', text, flags=re.M)

re.M标志是多行说明符,当您的字符串在匹配模式中有换行符时非常有用。

现在,此代码将删除所有出现的(..)

答案 2 :(得分:1)

我会尝试使用regexp。这样的事情:

pattern = re.compile('(\w+)\(\d+\);')
';'.join(re.findall(pattern, string))

对于每个字符串

答案 3 :(得分:1)

这个正则表达式摆脱了带括号的数字组,它也消除了任何'>'个字符,因为它似乎也想要消除它们。

import re

data = '''\
> Otu00467  Bacteria(100);Gracilibacteria(99);unclassified(99);>unclassified(99);unclassified(99);unclassified(99);
> Otu00469  Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470  Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);
'''

data = re.sub(r'>|\(\d+\)', '', data)
print(data)

<强>输出

 Otu00467  Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00469  Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
 Otu00470  Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;

此代码适用于Python 2&amp; 3。

答案 4 :(得分:0)

#Use re module to use regex
import re

#Open file and read data in data variable
data = open('file.txt').read()

#Apply search and replace on data variable
data = re.sub('\(\d+\)', '', data)

#Print data to output.txt file
with open('output.txt', 'w') as out:
    out.write(data)