如何比较一个文件中的多行并输出组合条目

时间:2014-01-29 18:01:00

标签: python perl unix awk bioinformatics

我有这个显示四列的文件:

chr start end transcript

像这样:

chrI    128980  129130  F53G12.5b  
chrI    132280  132430  F53G12.5c.2  
chrI    132280  132430  F53G12.5a  
chrI    132280  132430  F53G12.5b  
chrI    132280  132430  F53G12.5c.1  
chrI    133600  133750  F53G12.5c.2  
chrI    133600  133750  F53G12.5a  
chrI    133600  133750  F53G12.5b  
chrI    133600  133750  F53G12.5c.1  
chrI    136240  136390  F53G12.4  
chrI    139100  139250  F53G12.3  
chrI    163220  163370  F56C11.2a  
chrI    163220  163370  F56C11.2b  
chrI    173900  174050  F56C11.6a  
chrI    173900  174050  F56C11.6b  
chrI    173900  174050  F56C11.6c  
chrI    182240  182390  F56C11.3  
chrI    184080  184230  Y48G1BL.2a  
chrI    190720  190870  Y48G1BL.2a  

并且许多区域(由chr start end描述)被重复,因为它们映射到超过1个转录本

例如:

chrI    133600  133750  F53G12.5c.2  
chrI    133600  133750  F53G12.5a  
chrI    133600  133750  F53G12.5b  
chrI    133600  133750  F53G12.5c.1  

我想要的是一个代码,它采用列1,2,3相同的行并从中获取第4列的最短公共部分(在本例中为F53G12.5)并输出一个精简条目,即:

chrI    133600  133750  F53G12.5

或者例如:

chrI    83280   83430   Y48G1C.10a  
chrI    90420   90570   Y48G1C.10b  
chrI    90420   90570   Y48G1C.10c  
chrI    90420   90570   Y48G1C.10a  

应该给出

 chrI    83280   83430   Y48G1C.10a  
 chrI    90420   90570   Y48G1C.10  

你对此有什么建议吗?非常感谢

7 个答案:

答案 0 :(得分:1)

我怀疑这可以和Pandas相提并论,比这更好,但是我对Pandas还不是很熟悉,所以......没有调试就提交了。

def longest_identical_substring(words):
    result = words[0]
    for idx in range(len(words[0]), 0, -1):
        substrings = [w[:idx] for w in words]
        if max(substrings) == min(substrings): 
            result = substrings[0]
        else:
            return result

transcripts = defaultdict(list)
with open('myfile.csv') as infile:
    reader = csv.reader(infile)
    for row in reader:
        transcripts[row[:3]].append(row[3])
for ((chr, start, end), ts) in transcripts.items():
    print(chr, start, end, longest_identical_substring(ts))

答案 1 :(得分:1)

awk的一种方式。如果需要,您可以将其传送到sort

script.awk

的内容
(a[$1" "$2" "$3]) {
    t=0; word=""; delete w1; delete w2;
    split($4,w1,""); 
    split(a[$1" "$2" "$3],w2,"");
    t=(length($4)<length(a[$1" "$2" "$3]))?length($4):length(a[$1" "$2" "$3])
    for (x=1;x<=t;x++) { 
        if (w1[x]==w2[x]) { 
            word=word""w1[x] 
        }
    a[$1" "$2" "$3]=word
    }
    next
} 

{
    a[$1" "$2" "$3]=$4
}

END {
        for (x in a)  print x,a[x]
}

您的文件:

$ cat file
chrI    128980  129130  F53G12.5b
chrI    132280  132430  F53G12.5c.2
chrI    132280  132430  F53G12.5a
chrI    132280  132430  F53G12.5b
chrI    132280  132430  F53G12.5c.1
chrI    133600  133750  F53G12.5c.2
chrI    133600  133750  F53G12.5a
chrI    133600  133750  F53G12.5b
chrI    133600  133750  F53G12.5c.1
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2a
chrI    163220  163370  F56C11.2b
chrI    173900  174050  F56C11.6a
chrI    173900  174050  F56C11.6b
chrI    173900  174050  F56C11.6c
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2a
chrI    190720  190870  Y48G1BL.2a

输出:

$ awk -f script.awk file
chrI 173900 174050 F56C11.6
chrI 128980 129130 F53G12.5b
chrI 182240 182390 F56C11.3
chrI 139100 139250 F53G12.3
chrI 136240 136390 F53G12.4
chrI 132280 132430 F53G12.5
chrI 163220 163370 F56C11.2
chrI 184080 184230 Y48G1BL.2a
chrI 190720 190870 Y48G1BL.2a
chrI 133600 133750 F53G12.5

答案 2 :(得分:0)

这是昨晚的努力结束: - )

#!/bin/bash
sort file                                           |\
   awk '
      NR==1 {f123=$1" "$2" "$3;trans=$4;next}       # NR=1, i.e. first line

      {                                             # NR!=1, i.e. subsequent lines
         if(f123!=$1" "$2" "$3){                    # Fields 1-3 have changed
            printf "%s %s\n",f123,trans
            f123=$1" "$2" "$3;trans=$4
         }else{                                     # Fields 1-3 unchanged, do common transcript
            newtrans=$4
            x=length(newtrans)                      # Get shorter of two transcripts
            if(length(trans)<x) x=length(trans)     # Copy common part
            common=""
            for(c=1;c<=x;c++){
               if(substr(trans,c,1)==substr(newtrans,c,1))common=common""substr(trans,c,1)
            }
            trans=common
         }
      }
      END {if(common)printf "%s %s\n",f123,common}
   '

一些注释......基本上输入文件已经过排序,因此具有相似char / start / end值的记录彼此相邻。然后用管道输入awk。当读取第一行时,字段(列)1到3聚集在一起并保存为变量“f123”。在读取后续行时,将前3列与最后看到的3列进行比较。如果前三列的任何部分已更改,则显示的最后一行与其副本一起输出。如果前三列没有改变,那么我们有一个新的成绩单要处理。然后通过复制字母计算最后一个抄本和当前抄本共有的最短前缀,直到一个不同,并且下一个列1到3改变时保存新的抄本用于输出。当我们达到最后一条记录时,我们可能已经积累了一个新的共同记录,如果我们是,我们会输出它。

答案 3 :(得分:0)

如果没有所有的调试,这是一个简单的awk语句:

awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt 
chrI    128980  129130  F53G12.5
chrI    132280  132430  F53G12.5
chrI    132280  132430  F53G12.5
chrI    132280  132430  F53G12.5
chrI    132280  132430  F53G12.5
chrI    133600  133750  F53G12.5
chrI    133600  133750  F53G12.5
chrI    133600  133750  F53G12.5
chrI    133600  133750  F53G12.5
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2
chrI    163220  163370  F56C11.2
chrI    173900  174050  F56C11.6
chrI    173900  174050  F56C11.6
chrI    173900  174050  F56C11.6
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2
chrI    190720  190870  Y48G1BL.2


awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort|uniq
chrI    128980  129130  F53G12.5
chrI    132280  132430  F53G12.5
chrI    133600  133750  F53G12.5
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2
chrI    173900  174050  F56C11.6
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2
chrI    190720  190870  Y48G1BL.2

按列排序-nrk数字反向k表示列id,在这种情况下我传递了2和3

awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort -nrk2,3|uniq
chrI    190720  190870  Y48G1BL.2
chrI    184080  184230  Y48G1BL.2
chrI    182240  182390  F56C11.3
chrI    173900  174050  F56C11.6
chrI    163220  163370  F56C11.2
chrI    139100  139250  F53G12.3
chrI    136240  136390  F53G12.4
chrI    133600  133750  F53G12.5
chrI    132280  132430  F53G12.5
chrI    128980  129130  F53G12.5

根据列进行更新:

awk  '{ if( match($4, /[0-9a-zA-Z]+\.[0-9a-zA-Z]/)) {  trimmed=substr($4,RSTART,RLENGTH); } print $1"\t"$2"\t"$3"\t"trimmed;}' test.txt |sort|uniq
chrI    128980  129130  F53G12.5
chrI    132280  132430  F53G12.5
chrI    133600  133750  F53G12.5
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2
chrI    173900  174050  F56C11.6
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2
chrI    190720  190870  Y48G1BL.2

答案 4 :(得分:0)

我认为Python的itertools.groupbyitertools.takewhile可以处理问题的两个部分,按前三列中的值对行进行分组,并将第四列修剪为其公共前缀。

import itertools
from operator import itemgetter

def combine(data):
    for group, group_lines in itertools.groupby(data, itemgetter(0,1,2)):
        names = [line[3] for line in group_lines]
        prefix = "".join(t[0] for t in itertools.takewhile(lambda x:len(set(x))==1,
                                                           zip(*names)))
        yield group + (prefix,)

运行类似:

with open(filename) as f:
    for item in combine(line.split() for line in f):
        print("{:8}{:8}{:8}{}".format(*item))

示例运行:

>>> data = """chrI    128980  129130  F53G12.5b
chrI    132280  132430  F53G12.5c.2
chrI    132280  132430  F53G12.5a
chrI    132280  132430  F53G12.5b
chrI    132280  132430  F53G12.5c.1
chrI    133600  133750  F53G12.5c.2
chrI    133600  133750  F53G12.5a
chrI    133600  133750  F53G12.5b
chrI    133600  133750  F53G12.5c.1
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2a
chrI    163220  163370  F56C11.2b
chrI    173900  174050  F56C11.6a
chrI    173900  174050  F56C11.6b
chrI    173900  174050  F56C11.6c
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2a
chrI    190720  190870  Y48G1BL.2a""".splitlines()
>>> for item in combine(line.split() for line in data):
        print("{:8}{:8}{:8}{}".format(*item))


chrI    128980  129130  F53G12.5b
chrI    132280  132430  F53G12.5
chrI    133600  133750  F53G12.5
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2
chrI    173900  174050  F56C11.6
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2a
chrI    190720  190870  Y48G1BL.2a

答案 5 :(得分:0)

使用awk

awk  '{sub(/[^0-9]+/,"",$2);NF=2} !a[$1]++' FS=. OFS=. file

chrI    128980  129130  F53G12.5
chrI    132280  132430  F53G12.5
chrI    133600  133750  F53G12.5
chrI    136240  136390  F53G12.4
chrI    139100  139250  F53G12.3
chrI    163220  163370  F56C11.2
chrI    173900  174050  F56C11.6
chrI    182240  182390  F56C11.3
chrI    184080  184230  Y48G1BL.2
chrI    190720  190870  Y48G1BL.2

答案 6 :(得分:0)

你们真棒!感谢所有答案提交。这是使用Perl的另一种解决方案:

#!/usr/bin/env perl
#use Data::Dumper qw(Dumper);
use strict;
use warnings;


my $filename = $ARGV[0];
my @matrix;
my @transcripts;
my @transcript;
my %referenceTable;

my $count=0;
my $oldkey="";
my $key="";
my @keys;
my @key;
my %hash;
    open FILE,"< $filename" or die "can not open file\n";
    while (my $line=<FILE>) {
      my ($chromosome, $start, $stop, $transcript) = split("\t", $line);
        $key = $chromosome . "SPACE" . $start . "SPACE" . $stop;
        if ($oldkey ne $key) {

            $count = 0;
            $oldkey = $key;
        }
        push @{$referenceTable{$key}}, $transcript;

        $count++;
     }
my $output;
my ($k, $v); #Not @v -- $v will contain string that will be a reference to an array
while (($k, $v) = each(%referenceTable)){
 my ($chromosome, $start, $stop) = split(/SPACE/, $k);
 print "chromosome start stop \: $chromosome\t $start\t $stop \t";
 print "Common prefix \: \t ";
 $output = getleastcommonprefix(@{$v});
  print $output . "\n";
}

#print Dumper \%referenceTable;


sub getleastcommonprefix {
    my @searcharray = @_;
    my $common      = $searcharray[0];
    foreach my $index (1 .. $#searcharray) {
        $_ = $searcharray[0] . reverse $searcharray[$index];
        m/(.*)(.*)(??{quotemeta reverse $1})/s;
        if (length $1 < length $common) {
            $common = $1;
        }
    } ## end foreach my $index (1 .. $#searcharray)
    return $common;
} ## end sub getleastcommonprefix

#print 'Common prefix for file $filename [' . getleastcommonprefix(@array_of_test_names) . "]\n";