基于python中的多个因子进行过滤

时间:2016-12-15 13:19:06

标签: python

我有一个包含3列的文本文件,并希望根据第3列进行过滤。 第1列有id,第3列有字符序列。在第1列中,每个id重复,但每个重复序列在第3列中具有不同长度的不同序列。在某些情况下,由于没有序列,因此将其替换为RPL17 ENST00000584364 not present RPL17 ENST00000579248 CTGCGTTGCTCCGAGGGCCCAATCCTCCTGCCATCGCCGCCATCCTGGCTTCGGGGGCGCCGGCCT RPL17 ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA RPL18 ENST00000551749 not present RPL18 ENST00000546623 not present RPL18 ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC RPL18 ENST00000547897 ACCTGGCCGAGCAGGAGGCGCCATC RPL18 ENST00000550645 GCCGAGCAGGAGGCGCCATC RPL18 ENST00000552705 not present 。 我想只用序列重复每个id的一次重复,序列也必须是最长的序列。

示例:

RPL17   ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA
RPL18   ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC

结果:

with open("file.txt") as f, open('test.txt', 'w') as outfile:
    for line in f:
        line=line.split(",")
           .
           .
           .
           outfile.writerow(entry)

我写了这段代码,我改变了中间部分几次,但没有像我想要的那样工作。

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="Auto"/>
        <RowDefinition Height="Auto" />
        <RowDefinition Height="*" />
    </Grid.RowDefinitions>
    <Grid.ColumnDefinitions>
        <ColumnDefinition />
        <ColumnDefinition />
    </Grid.ColumnDefinitions>
    <!-- first row, the Menu spans both columns -->
    <Menu Grid.Column="0" Grid.Row="0" Grid.ColumnSpan="2">
        <MenuItem Header="_Datei" />
        <MenuItem Header="_Bearbeiten" />
        <MenuItem Header="_Verwaltung" />
        <MenuItem Header="_Vorlagen" />
        <MenuItem Header="_Gestaltung" />
        <MenuItem Header="_Extras" />
        <MenuItem Header="_Hilfe" />
    </Menu>
    <!-- the bar with one button to the left and another one to the right-->
    <Button Content="Left" Grid.Column="0" Grid.Row="1" />
    <Button Content="Right" Grid.Column="1" Grid.Row="1" />

    <Grid Grid.Column="0" Grid.Row="2" Grid.ColumnSpan="2">
        <Grid.ColumnDefinitions>
            <ColumnDefinition Width="100" />
            <ColumnDefinition Width="*" />
        </Grid.ColumnDefinitions>
        <Border Background="Silver" Grid.Column="0">
            <!-- Sidebar-->
        </Border>
        <Border Background="Yellow" Grid.Column="1">
            <!-- The Content-->
        </Border>
    </Grid>
</Grid>

3 个答案:

答案 0 :(得分:0)

输入文件看起来像是一种柱状格式。首先,我们必须弄清楚哪些字段在哪些列中,然后我们可以使用dict来确保我们只保留给定ID的最长序列。

以下是我认为你要求的内容:

# 00000000001111111111222222222233333333334
# 01234567890123456789012345678901234567890
# RPL17   ENST00000584364 not present
from collections import OrderedDict
sequences = OrderedDict()
with open("file.txt") as f, open('test.txt', 'w') as outfile:
    for line in f:
        st_id = line[:8].strip()
        sequence = line[24:].strip()
        value, _ = sequences.get(st_id, ('', None))
        if not value or value == 'not present' or len(sequence) > len(value):
            sequences[st_id] = (sequence, line)
    for _, line in sequences.values():
        outfile.write(line)

答案 1 :(得分:0)

from collections import defaultdict

d = defaultdict(list)
with open('you_data.txt') as f, open('out.txt', 'w') as out:
    s_line = [line.split('   ')for line in f]
    for k, v in s_line:
        d[k].append(v)
# {'RPL18': ['ENST00000551749 not present\n', 'ENST00000546623 not present\n', 'ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC\n', 'ENST00000547897 ACCTGGCCGAGCAGGAGGCGCCATC\n', 'ENST00000550645 GCCGAGCAGGAGGCGCCATC\n', 'ENST00000552705 not present']
    for k, v in d.items():
        long_v = sorted(v, key=len, reverse=True)[0]
        out.write('   '.join([k, long_v]))

出:

RPL18   ENST00000552588 TCTCTCTTTCCGGACCTGGCCGAGCAGGAGGCGCCATC
RPL17   ENST00000580210 GCCCGTGTGGCTACTTCTGTGGAAGCAGTGCTGTAGTTACTGGAAGATAAAAGGGAAAGCAAGCCCTTGGTGGGGGAAA

enter image description here

答案 2 :(得分:0)

我很确定这是你想要的,虽然我确定它可以清理一下。 maxitemgetter结合将返回带有最长序列的行的元组,因为这对每个id都有,它应该是您想要的,并且可能是最快的排序方法。 / p>

我使用逗号作为分隔符,因为您说数据是用逗号分隔的,尽管您向我们展示的内容是用空格分隔的,但您可以将它更改为分隔符。输出我也用逗号分隔,但您也可以将其更改为输出分隔符应该是什么。

更新:上一行最后一行没有正确设置行,并且在写完行后我没有将lines重置为空,所以它会有没有正常工作。此外,由于我会重复编写代码,因此我将把您需要的重要行放入函数(make_row)。

我用逗号分隔数据进行了测试,效果很好。

from operator import itemgetter
import csv


def make_row(lines):
    return map(str.strip, max(lines, key=itemgetter(2)))

with open("file.txt") as f, open('test.txt', 'w') as outfile:
    output = csv.writer(outfile)
    id = ''
    lines = []
    for line in f:
        current_line = line.split(",")
        if current_line[0] != id and lines != []:
            output.writerow(make_row(lines))
            lines=[]
        id = current_line[0]
        if current_line[2].strip() != 'not present':
            lines.append(current_line)
    output.writerow(make_row(lines))  # to catch the last row