寻找一些替代方法来清理包含括号之间信息的表格文件。 这将是包含在管道中的第一步,我需要删除括号内的所有值(包括括号)。
我有什么
> Otu00467 Bacteria(100);Gracilibacteria(99);unclassified(99);unclassified(99);unclassified(99);unclassified(99);
> Otu00469 Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);
我想要的是什么:
Otu00467 Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
Otu00469 Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
Otu00470 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;
我的第一个方法是将第二列拆分为“;” ,“(”,“)”并进一步加入一切。不错,但太丑了。
谢谢。
答案 0 :(得分:1)
import re
new_string = re.sub(r'\(.*?\)', '', your_string)
答案 1 :(得分:1)
您可以使用正则表达式轻松完成此操作。
import re
text = open('file.txt').read()
text = re.sub(r'\(.*?\)', '', text, flags=re.M)
re.M
标志是多行说明符,当您的字符串在匹配模式中有换行符时非常有用。
现在,此代码将删除所有出现的(..)
。
答案 2 :(得分:1)
我会尝试使用regexp。这样的事情:
pattern = re.compile('(\w+)\(\d+\);')
';'.join(re.findall(pattern, string))
对于每个字符串
答案 3 :(得分:1)
这个正则表达式摆脱了带括号的数字组,它也消除了任何'>'
个字符,因为它似乎也想要消除它们。
import re
data = '''\
> Otu00467 Bacteria(100);Gracilibacteria(99);unclassified(99);>unclassified(99);unclassified(99);unclassified(99);
> Otu00469 Bacteria(100);Proteobacteria(96);unclassified(96);unclassified(96);unclassified(96);unclassified(96);
> Otu00470 Bacteria(100);Proteobacteria(100);Alphaproteobacteria(100);Rhodospirillales(100);Rhodospirillaceae(100);Azospirillum(54);
'''
data = re.sub(r'>|\(\d+\)', '', data)
print(data)
<强>输出强>
Otu00467 Bacteria;Gracilibacteria;unclassified;unclassified;unclassified;unclassified;
Otu00469 Bacteria;Proteobacteria;unclassified;unclassified;unclassified;unclassified;
Otu00470 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Azospirillum;
此代码适用于Python 2&amp; 3。
答案 4 :(得分:0)
#Use re module to use regex
import re
#Open file and read data in data variable
data = open('file.txt').read()
#Apply search and replace on data variable
data = re.sub('\(\d+\)', '', data)
#Print data to output.txt file
with open('output.txt', 'w') as out:
out.write(data)