Question

我有一个包含7列的大文件，我想比较2列，col 1和col 7

 chr_locations(col 1)        gene_name(col 7)
chr1:66997989-67000678        geneA
chr1:66997824-67000456        geneA
chr2:33544389-33548489        geneB
chr2:33546285-33547055        geneB
chr2:44567890-44568980        geneB

我想计算给定基因的染色体位置的出现次数：

chr1:66997989-67000678    geneA     2
chr1:66997824-67000456    geneA     2
chr2:33544389-33548489    geneB     3
chr2:33546285-33547055    geneB     3
chr2:44567890-44568980    geneB     3

我确信在awk中有一种比在python中编写脚本更简单的方法，你们中的任何人都可以帮忙吗？谢谢。

Answer 1

您需要一个数组来保持计数，并使用由2列构建的数组键

    ShapeRenderer shapeRenderer = new ShapeRenderer();

    shapeRenderer.begin(ShapeRenderer.ShapeType.Line);
    shapeRenderer.setColor(0, 0, 0, 1);

    float unitHeight = Gdx.graphics.getHeight() / 9;
    float indent = Gdx.graphics.getWidth() / 20;

    shapeRenderer.rect(indent, unitHeight, Gdx.graphics.getWidth() - indent * 2, unitHeight);
    shapeRenderer.rect(indent, unitHeight * 3, Gdx.graphics.getWidth() - indent * 2, unitHeight);
    shapeRenderer.rect(indent, unitHeight * 5, Gdx.graphics.getWidth() - indent * 2, unitHeight);
    shapeRenderer.rect(indent, unitHeight * 7, Gdx.graphics.getWidth() - indent * 2, unitHeight);

    shapeRenderer.end();

如果您希望我们测试我们的答案，您需要提供一些实际数据。

Answer 2

使用这两种语言很容易（真的是任何语言）....一切都取决于你的知识

<强> AWK

awk '{
    count[$7]++; 
    memory_1[NR] = $1; 
    memory_7[NR] = $7;
} 
END{
    for(i=1; i<=NR; ++i) print memory_1[i] OFS memory_7[i] OFS count[memory_7[i]]
}' file

<强>蟒

records = [line.split() for line in open("file").readlines()]
from collections import Counter
count = Counter(r[6] for r in records)
print "\n".join("\t".join((r[0], r[6], str(count[r[6]]))) for r in records)

你得到：

chr1:66997989-67000678  geneA   2
chr1:66997824-67000456  geneA   2
chr2:33544389-33548489  geneB   3
chr2:33546285-33547055  geneB   3
chr2:44567890-44568980  geneB   3

使用awk

2 个答案: