我有一个包含3列的文本文件,并希望根据第3列进行过滤。
第1列有id,第3列有字符序列。在第1列中,每个id重复,但每个重复序列在第3列中具有不同长度的不同序列。在某些情况下,由于没有序列,因此将其替换为RPL17 ENST00000584364 not present
RPL17 ENST00000579248 CTGCGTTGCTCCGAGGGCCCAATCCTCCTGCCATCGCCGCCATCCTGGCTTCGGGGGCGCCGGCCT
RPL17 ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA
RPL18 ENST00000551749 not present
RPL18 ENST00000546623 not present
RPL18 ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC
RPL18 ENST00000547897 ACCTGGCCGAGCAGGAGGCGCCATC
RPL18 ENST00000550645 GCCGAGCAGGAGGCGCCATC
RPL18 ENST00000552705 not present
。
我想只用序列重复每个id的一次重复,序列也必须是最长的序列。
示例:
RPL17 ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA
RPL18 ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC
结果:
with open("file.txt") as f, open('test.txt', 'w') as outfile:
for line in f:
line=line.split(",")
.
.
.
outfile.writerow(entry)
我写了这段代码,我改变了中间部分几次,但没有像我想要的那样工作。
<Grid>
<Grid.RowDefinitions>
<RowDefinition Height="Auto"/>
<RowDefinition Height="Auto" />
<RowDefinition Height="*" />
</Grid.RowDefinitions>
<Grid.ColumnDefinitions>
<ColumnDefinition />
<ColumnDefinition />
</Grid.ColumnDefinitions>
<!-- first row, the Menu spans both columns -->
<Menu Grid.Column="0" Grid.Row="0" Grid.ColumnSpan="2">
<MenuItem Header="_Datei" />
<MenuItem Header="_Bearbeiten" />
<MenuItem Header="_Verwaltung" />
<MenuItem Header="_Vorlagen" />
<MenuItem Header="_Gestaltung" />
<MenuItem Header="_Extras" />
<MenuItem Header="_Hilfe" />
</Menu>
<!-- the bar with one button to the left and another one to the right-->
<Button Content="Left" Grid.Column="0" Grid.Row="1" />
<Button Content="Right" Grid.Column="1" Grid.Row="1" />
<Grid Grid.Column="0" Grid.Row="2" Grid.ColumnSpan="2">
<Grid.ColumnDefinitions>
<ColumnDefinition Width="100" />
<ColumnDefinition Width="*" />
</Grid.ColumnDefinitions>
<Border Background="Silver" Grid.Column="0">
<!-- Sidebar-->
</Border>
<Border Background="Yellow" Grid.Column="1">
<!-- The Content-->
</Border>
</Grid>
</Grid>
答案 0 :(得分:0)
输入文件看起来像是一种柱状格式。首先,我们必须弄清楚哪些字段在哪些列中,然后我们可以使用dict来确保我们只保留给定ID的最长序列。
以下是我认为你要求的内容:
# 00000000001111111111222222222233333333334
# 01234567890123456789012345678901234567890
# RPL17 ENST00000584364 not present
from collections import OrderedDict
sequences = OrderedDict()
with open("file.txt") as f, open('test.txt', 'w') as outfile:
for line in f:
st_id = line[:8].strip()
sequence = line[24:].strip()
value, _ = sequences.get(st_id, ('', None))
if not value or value == 'not present' or len(sequence) > len(value):
sequences[st_id] = (sequence, line)
for _, line in sequences.values():
outfile.write(line)
答案 1 :(得分:0)
from collections import defaultdict
d = defaultdict(list)
with open('you_data.txt') as f, open('out.txt', 'w') as out:
s_line = [line.split(' ')for line in f]
for k, v in s_line:
d[k].append(v)
# {'RPL18': ['ENST00000551749 not present\n', 'ENST00000546623 not present\n', 'ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC\n', 'ENST00000547897 ACCTGGCCGAGCAGGAGGCGCCATC\n', 'ENST00000550645 GCCGAGCAGGAGGCGCCATC\n', 'ENST00000552705 not present']
for k, v in d.items():
long_v = sorted(v, key=len, reverse=True)[0]
out.write(' '.join([k, long_v]))
出:
RPL18 ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC
RPL17 ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA
答案 2 :(得分:0)
我很确定这是你想要的,虽然我确定它可以清理一下。 max
与itemgetter
结合将返回带有最长序列的行的元组,因为这对每个id都有,它应该是您想要的,并且可能是最快的排序方法。 / p>
我使用逗号作为分隔符,因为您说数据是用逗号分隔的,尽管您向我们展示的内容是用空格分隔的,但您可以将它更改为分隔符。输出我也用逗号分隔,但您也可以将其更改为输出分隔符应该是什么。
更新:上一行最后一行没有正确设置行,并且在写完行后我没有将lines
重置为空,所以它会有没有正常工作。此外,由于我会重复编写代码,因此我将把您需要的重要行放入函数(make_row
)。
我用逗号分隔数据进行了测试,效果很好。
from operator import itemgetter
import csv
def make_row(lines):
return map(str.strip, max(lines, key=itemgetter(2)))
with open("file.txt") as f, open('test.txt', 'w') as outfile:
output = csv.writer(outfile)
id = ''
lines = []
for line in f:
current_line = line.split(",")
if current_line[0] != id and lines != []:
output.writerow(make_row(lines))
lines=[]
id = current_line[0]
if current_line[2].strip() != 'not present':
lines.append(current_line)
output.writerow(make_row(lines)) # to catch the last row