我有一个巨大的文件,如下所示:
CAV-1 ATCTACTTCTATCG
CAV-2 GCGCGTAGCTAGCT
CAV-2 AAGCGCTCGTAAAA
CAV-3 AAATATATATATCC
使用Python,我想删除具有重复字符串的行,在本例中为" CAV-2"。将保留具有该字符串的第一行。我会得到这个:
CAV-1 ATCTACTTCTATCG
CAV-2 GCGCGTAGCTAGCT
CAV-3 AAATATATATATCC
我知道如何使用正则表达式并通过行解析,但我无法执行此特定任务。
我知道如何使用
答案 0 :(得分:3)
只需使用字典
In [1]: lines = '''CAV-1 ATCTACTTCTATCG
...: CAV-2 GCGCGTAGCTAGCT
...: CAV-2 AAGCGCTCGTAAAA
...: CAV-3 AAATATATATATCC'''
In [2]: lines
Out[2]: 'CAV-1 ATCTACTTCTATCG\nCAV-2 GCGCGTAGCTAGCT\nCAV-2 AAGCGCTCGTAAAA\nCAV-3 AAATATATATATCC'
In [3]: res = {}
In [4]: for line in lines.split("\n"):
...: res[line.split(" ")[0]] = line.split(" ")[1]
...:
In [5]: res
Out[5]:
{'CAV-1': 'ATCTACTTCTATCG',
'CAV-2': 'AAGCGCTCGTAAAA',
'CAV-3': 'AAATATATATATCC'}
In [6]: '\n'.join(['%s %s' % (key, value) for (key, value) in res.items()])
Out[6]: 'CAV-1 ATCTACTTCTATCG\nCAV-2 AAGCGCTCGTAAAA\nCAV-3 AAATATATATATCC'
如果要保留第一行,可以使用列表字典,然后输出最后一个元素
答案 1 :(得分:1)
您必须使用此类捕获组。
正则表达式: ((CAV-\d\s)[AGCT]+)(?:\n\2[AGCT]+)*
<强>解释强>
((CAV-\d\s)[AGCT]+)
检查您的模式并捕获整个匹配项。在第二个捕获组中捕获了子匹配CAV-\d\s
。
(?:\n\2[AGCT]+)*
检查其中包含子模式CAV-\d\s
的多个匹配项。
最后用第一个捕获的组替换整个匹配,即第一个模式。
<强> Regex101 Demo 强>
Python代码(在Python 3.5.2中测试)
import re
# Open file having genetic code. Use your file path.
new1 = 'C:\\Users\\acer\\Desktop\\new1.txt'
# Create a new file for replaced data. Use your file path.
new2 = 'C:\\Users\\acer\\Desktop\\new2.txt'
fp1 = open( new1, 'r') # Opening original file in read mode
fp2 = open( new2, 'w') # Opening replaced data in write mode.
lines = fp1.readlines() # Reading data from original file.
lines = ''.join(lines) # Joined all lines as one line.
# Regex substitution on joined lines. Will repalce the duplicate data.
lines = re.sub(r'((CAV-\d+\s)[AGCT]+)(?:\n\2[AGCT]+)*', r'\1', lines)
#Writing replaced data to new file.
fp2.write(lines)
# Closing files.
fp1.close()
fp2.close()
答案 2 :(得分:1)
正如其他用户所指出的,正则表达式不是解决此问题的最佳方法。您可以使用字典,然后删除重复项:
from collections import defaultdict
d = defaultdict(list)
s = ["CAV-1 ATCTACTTCTATCG", "CAV-2 GCGCGTAGCTAGCT", "CAV-2 AAGCGCTCGTAAAA", "CAV-3 AAATATATATATCC"]
for name, sequence in [i.split() for i in s]:
d[name].append(sequence)
final_output = [' '.join([a, b[0]]) for a, b in d.items()]
输出:
['CAV-1 ATCTACTTCTATCG', 'CAV-2 GCGCGTAGCTAGCT', 'CAV-3 AAATATATATATCC']