计算列中的特定模式数

时间:2015-11-03 17:24:14

标签: awk

鉴于以下文件,我想计算每列不相同的每种模式的出现频率,即:

A/A C/G C/G
A/T C/C G/G
A/A C/G C/C
A/T C/G C/G
T/T C/G C/G

输出:

A/T = 2/5
C/G = 4/5
C/G = 3/5

我在AWK中尝试了一些代码,但似乎没有用。我很感激它,谢谢!

编辑:

我重新创建了我的文件如下:

A A C G C G
A T C C G G
A A C G C C
A T C G C G
T T C G C G

awk '$1 != $2 {n++}; END {print n}' file

这给了我前两列的出现次数。我现在想循环遍历列,并检查每两列是否相等,即1是2,3是4等等。

我怎样才能在奇数列上实现循环?

4 个答案:

答案 0 :(得分:1)

我会这样做:

from collections import Counter

with open('file.txt', 'r') as raw_data:
    data = [line.strip().split() for line in raw_data.readlines()]
a = [record[0] for record in data]
b = [record[1] for record in data]
c = [record[2] for record in data]

print Counter(a)
print Counter(b)
print Counter(c)

它将数据打印为字典,但您可以从现在开始处理它,对吗?

答案 1 :(得分:0)

这可能会有所帮助。然而,也许,有更好的方法来做到这一点:

queryWords = Arrays.stream(queryWords).map(s -> "%"+s+"%").toArray(String[]::new);

输出:

text = """A/A C/G C/G
A/T C/C G/G
A/A C/G C/C
A/T C/G C/G
T/T C/G C/G"""

first_column = list()
second_column = list()
third_column = list()

for row in text.strip().split('\n'):
    columns = row.split()
    first_column.append(columns[0])
    second_column.append(columns[1])
    third_column.append(columns[2])

first_column_ocurrences = dict((i, "{}/{}".format(first_column.count(i), len(first_column))) for i in first_column)
second_column_ocurrences = dict((i, "{}/{}".format(second_column.count(i), len(second_column))) for i in second_column)
third_column_ocurrences = dict((i, "{}/{}".format(third_column.count(i), len(third_column))) for i in third_column)

print "First column:"
print "-------------"
for k,v in first_column_ocurrences.items():
    print "{} = {}".format(k,v)

print "\nSecond column:"
print "-------------"

for k,v in second_column_ocurrences.items():
    print "{} = {}".format(k,v)

print "\nThird column:"
print "-------------"

for k,v in third_column_ocurrences.items():
    print "{} = {}".format(k,v)

答案 2 :(得分:0)

要求救援!

适用于任意偶数列。

awk '{for(i=1;i<=NF;i+=2) 
         if($i!=$(i+1)) 
             a["column "i": "$i"/"$(i+1)]++} 
  END{for(k in a) print k,a[k]"/"NR}' file

column 1: A/T 2/5
column 3: C/G 4/5
column 5: C/G 3/5

答案 3 :(得分:0)

您根本不需要将行存储在内存中,您也可以使用csv lib进行解析:

from collections import Counter
import csv
with open('file.txt', 'r') as raw_data:
    cn_a, cn_b, cn_c = Counter(),Counter(), Counter()
    for a ,b, c in csv.reader(raw_data,delimiter=" "):
        cn_a[a] += 1
        cn_b[b] += 1
        cn_c[c] += 1