我有这个显示四列的文件:
chr start end transcript
像这样:chrI 128980 129130 F53G12.5b
chrI 132280 132430 F53G12.5c.2
chrI 132280 132430 F53G12.5a
chrI 132280 132430 F53G12.5b
chrI 132280 132430 F53G12.5c.1
chrI 133600 133750 F53G12.5c.2
chrI 133600 133750 F53G12.5a
chrI 133600 133750 F53G12.5b
chrI 133600 133750 F53G12.5c.1
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2a
chrI 163220 163370 F56C11.2b
chrI 173900 174050 F56C11.6a
chrI 173900 174050 F56C11.6b
chrI 173900 174050 F56C11.6c
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2a
chrI 190720 190870 Y48G1BL.2a
并且许多区域(由chr start end描述)被重复,因为它们映射到超过1个转录本
例如:
chrI 133600 133750 F53G12.5c.2
chrI 133600 133750 F53G12.5a
chrI 133600 133750 F53G12.5b
chrI 133600 133750 F53G12.5c.1
我想要的是一个代码,它采用列1,2,3相同的行并从中获取第4列的最短公共部分(在本例中为F53G12.5)并输出一个精简条目,即:
chrI 133600 133750 F53G12.5
或者例如:
chrI 83280 83430 Y48G1C.10a
chrI 90420 90570 Y48G1C.10b
chrI 90420 90570 Y48G1C.10c
chrI 90420 90570 Y48G1C.10a
应该给出
chrI 83280 83430 Y48G1C.10a
chrI 90420 90570 Y48G1C.10
你对此有什么建议吗?非常感谢
答案 0 :(得分:1)
我怀疑这可以和Pandas相提并论,比这更好,但是我对Pandas还不是很熟悉,所以......没有调试就提交了。
def longest_identical_substring(words):
result = words[0]
for idx in range(len(words[0]), 0, -1):
substrings = [w[:idx] for w in words]
if max(substrings) == min(substrings):
result = substrings[0]
else:
return result
transcripts = defaultdict(list)
with open('myfile.csv') as infile:
reader = csv.reader(infile)
for row in reader:
transcripts[row[:3]].append(row[3])
for ((chr, start, end), ts) in transcripts.items():
print(chr, start, end, longest_identical_substring(ts))
答案 1 :(得分:1)
awk
的一种方式。如果需要,您可以将其传送到sort
。
script.awk
(a[$1" "$2" "$3]) {
t=0; word=""; delete w1; delete w2;
split($4,w1,"");
split(a[$1" "$2" "$3],w2,"");
t=(length($4)<length(a[$1" "$2" "$3]))?length($4):length(a[$1" "$2" "$3])
for (x=1;x<=t;x++) {
if (w1[x]==w2[x]) {
word=word""w1[x]
}
a[$1" "$2" "$3]=word
}
next
}
{
a[$1" "$2" "$3]=$4
}
END {
for (x in a) print x,a[x]
}
$ cat file
chrI 128980 129130 F53G12.5b
chrI 132280 132430 F53G12.5c.2
chrI 132280 132430 F53G12.5a
chrI 132280 132430 F53G12.5b
chrI 132280 132430 F53G12.5c.1
chrI 133600 133750 F53G12.5c.2
chrI 133600 133750 F53G12.5a
chrI 133600 133750 F53G12.5b
chrI 133600 133750 F53G12.5c.1
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2a
chrI 163220 163370 F56C11.2b
chrI 173900 174050 F56C11.6a
chrI 173900 174050 F56C11.6b
chrI 173900 174050 F56C11.6c
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2a
chrI 190720 190870 Y48G1BL.2a
$ awk -f script.awk file
chrI 173900 174050 F56C11.6
chrI 128980 129130 F53G12.5b
chrI 182240 182390 F56C11.3
chrI 139100 139250 F53G12.3
chrI 136240 136390 F53G12.4
chrI 132280 132430 F53G12.5
chrI 163220 163370 F56C11.2
chrI 184080 184230 Y48G1BL.2a
chrI 190720 190870 Y48G1BL.2a
chrI 133600 133750 F53G12.5
答案 2 :(得分:0)
这是昨晚的努力结束: - )
#!/bin/bash
sort file |\
awk '
NR==1 {f123=$1" "$2" "$3;trans=$4;next} # NR=1, i.e. first line
{ # NR!=1, i.e. subsequent lines
if(f123!=$1" "$2" "$3){ # Fields 1-3 have changed
printf "%s %s\n",f123,trans
f123=$1" "$2" "$3;trans=$4
}else{ # Fields 1-3 unchanged, do common transcript
newtrans=$4
x=length(newtrans) # Get shorter of two transcripts
if(length(trans)<x) x=length(trans) # Copy common part
common=""
for(c=1;c<=x;c++){
if(substr(trans,c,1)==substr(newtrans,c,1))common=common""substr(trans,c,1)
}
trans=common
}
}
END {if(common)printf "%s %s\n",f123,common}
'
一些注释......基本上输入文件已经过排序,因此具有相似char / start / end值的记录彼此相邻。然后用管道输入awk。当读取第一行时,字段(列)1到3聚集在一起并保存为变量“f123”。在读取后续行时,将前3列与最后看到的3列进行比较。如果前三列的任何部分已更改,则显示的最后一行与其副本一起输出。如果前三列没有改变,那么我们有一个新的成绩单要处理。然后通过复制字母计算最后一个抄本和当前抄本共有的最短前缀,直到一个不同,并且下一个列1到3改变时保存新的抄本用于输出。当我们达到最后一条记录时,我们可能已经积累了一个新的共同记录,如果我们是,我们会输出它。
答案 3 :(得分:0)
如果没有所有的调试,这是一个简单的awk语句:
awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt
chrI 128980 129130 F53G12.5
chrI 132280 132430 F53G12.5
chrI 132280 132430 F53G12.5
chrI 132280 132430 F53G12.5
chrI 132280 132430 F53G12.5
chrI 133600 133750 F53G12.5
chrI 133600 133750 F53G12.5
chrI 133600 133750 F53G12.5
chrI 133600 133750 F53G12.5
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2
chrI 163220 163370 F56C11.2
chrI 173900 174050 F56C11.6
chrI 173900 174050 F56C11.6
chrI 173900 174050 F56C11.6
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2
chrI 190720 190870 Y48G1BL.2
awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort|uniq
chrI 128980 129130 F53G12.5
chrI 132280 132430 F53G12.5
chrI 133600 133750 F53G12.5
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2
chrI 173900 174050 F56C11.6
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2
chrI 190720 190870 Y48G1BL.2
按列排序-nrk数字反向k表示列id,在这种情况下我传递了2和3
awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort -nrk2,3|uniq
chrI 190720 190870 Y48G1BL.2
chrI 184080 184230 Y48G1BL.2
chrI 182240 182390 F56C11.3
chrI 173900 174050 F56C11.6
chrI 163220 163370 F56C11.2
chrI 139100 139250 F53G12.3
chrI 136240 136390 F53G12.4
chrI 133600 133750 F53G12.5
chrI 132280 132430 F53G12.5
chrI 128980 129130 F53G12.5
根据列进行更新:
awk '{ if( match($4, /[0-9a-zA-Z]+\.[0-9a-zA-Z]/)) { trimmed=substr($4,RSTART,RLENGTH); } print $1"\t"$2"\t"$3"\t"trimmed;}' test.txt |sort|uniq
chrI 128980 129130 F53G12.5
chrI 132280 132430 F53G12.5
chrI 133600 133750 F53G12.5
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2
chrI 173900 174050 F56C11.6
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2
chrI 190720 190870 Y48G1BL.2
答案 4 :(得分:0)
我认为Python的itertools.groupby
和itertools.takewhile
可以处理问题的两个部分,按前三列中的值对行进行分组,并将第四列修剪为其公共前缀。
import itertools
from operator import itemgetter
def combine(data):
for group, group_lines in itertools.groupby(data, itemgetter(0,1,2)):
names = [line[3] for line in group_lines]
prefix = "".join(t[0] for t in itertools.takewhile(lambda x:len(set(x))==1,
zip(*names)))
yield group + (prefix,)
运行类似:
with open(filename) as f:
for item in combine(line.split() for line in f):
print("{:8}{:8}{:8}{}".format(*item))
示例运行:
>>> data = """chrI 128980 129130 F53G12.5b
chrI 132280 132430 F53G12.5c.2
chrI 132280 132430 F53G12.5a
chrI 132280 132430 F53G12.5b
chrI 132280 132430 F53G12.5c.1
chrI 133600 133750 F53G12.5c.2
chrI 133600 133750 F53G12.5a
chrI 133600 133750 F53G12.5b
chrI 133600 133750 F53G12.5c.1
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2a
chrI 163220 163370 F56C11.2b
chrI 173900 174050 F56C11.6a
chrI 173900 174050 F56C11.6b
chrI 173900 174050 F56C11.6c
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2a
chrI 190720 190870 Y48G1BL.2a""".splitlines()
>>> for item in combine(line.split() for line in data):
print("{:8}{:8}{:8}{}".format(*item))
chrI 128980 129130 F53G12.5b
chrI 132280 132430 F53G12.5
chrI 133600 133750 F53G12.5
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2
chrI 173900 174050 F56C11.6
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2a
chrI 190720 190870 Y48G1BL.2a
答案 5 :(得分:0)
使用awk
awk '{sub(/[^0-9]+/,"",$2);NF=2} !a[$1]++' FS=. OFS=. file
chrI 128980 129130 F53G12.5
chrI 132280 132430 F53G12.5
chrI 133600 133750 F53G12.5
chrI 136240 136390 F53G12.4
chrI 139100 139250 F53G12.3
chrI 163220 163370 F56C11.2
chrI 173900 174050 F56C11.6
chrI 182240 182390 F56C11.3
chrI 184080 184230 Y48G1BL.2
chrI 190720 190870 Y48G1BL.2
答案 6 :(得分:0)
你们真棒!感谢所有答案提交。这是使用Perl的另一种解决方案:
#!/usr/bin/env perl
#use Data::Dumper qw(Dumper);
use strict;
use warnings;
my $filename = $ARGV[0];
my @matrix;
my @transcripts;
my @transcript;
my %referenceTable;
my $count=0;
my $oldkey="";
my $key="";
my @keys;
my @key;
my %hash;
open FILE,"< $filename" or die "can not open file\n";
while (my $line=<FILE>) {
my ($chromosome, $start, $stop, $transcript) = split("\t", $line);
$key = $chromosome . "SPACE" . $start . "SPACE" . $stop;
if ($oldkey ne $key) {
$count = 0;
$oldkey = $key;
}
push @{$referenceTable{$key}}, $transcript;
$count++;
}
my $output;
my ($k, $v); #Not @v -- $v will contain string that will be a reference to an array
while (($k, $v) = each(%referenceTable)){
my ($chromosome, $start, $stop) = split(/SPACE/, $k);
print "chromosome start stop \: $chromosome\t $start\t $stop \t";
print "Common prefix \: \t ";
$output = getleastcommonprefix(@{$v});
print $output . "\n";
}
#print Dumper \%referenceTable;
sub getleastcommonprefix {
my @searcharray = @_;
my $common = $searcharray[0];
foreach my $index (1 .. $#searcharray) {
$_ = $searcharray[0] . reverse $searcharray[$index];
m/(.*)(.*)(??{quotemeta reverse $1})/s;
if (length $1 < length $common) {
$common = $1;
}
} ## end foreach my $index (1 .. $#searcharray)
return $common;
} ## end sub getleastcommonprefix
#print 'Common prefix for file $filename [' . getleastcommonprefix(@array_of_test_names) . "]\n";