Question

我有两个文件 - 一个是包含基因变体的大文件，多个列由制表符分隔。包含基因名称的列可以包含单个名称，也可以包含以逗号分隔的多个名称（示例中的基因名称为SAMD11和NOC2L）：

1   874816  874816  -   T   rs200996316 SAMD11  exonic  ENSG00000187634 frameshift insertion
1   878331  878331  C   T   rs148327885 SAMD11  exonic  ENSG00000187634 nonsynonymous SNV
1   879676  879676  G   A   rs6605067   NOC2L,SAMD11    UTR3    ENSG00000187634,ENSG00000188976
1   879687  879687  T   C   rs2839  NOC2L,SAMD11    UTR3    ENSG00000187634,ENSG00000188976

第二个文件是基因名称的单列列表，例如：

awk '
    {count[$1 OFS $7]++} 
    END {for (key in count) print key, count[key]}
' file | sort

我想将第二个文件中的基因名称与第一个文件中的基因名称相匹配。我目前正在使用awk：

public class TestGame extends ApplicationAdapter {
    SpriteBatch batch;
    boolean showingMenu;
    Texture background;
    Sprite edgeBlur;
    Texture edgeBlurTex;

    @Override
    public void create() {
        showingMenu = true;

        batch = new SpriteBatch();
        background = new Texture(Gdx.files.internal("blue1.png"));
        edgeBlurTex = new Texture(Gdx.files.internal("edge_blur.png"));
        edgeBlur = new Sprite(edgeBlurTex);
        edgeBlur.setSize(Gdx.graphics.getWidth(), Gdx.graphics.getHeight());
    }

    @Override
    public void resize(int width, int height) {
        super.resize(width, height);
        edgeBlur.setSize(width, height);
    }

    @Override
    public void dispose() {
        background.dispose();
        edgeBlurTex.dispose();
        super.dispose();
    }

    @Override
    public void render() {
        Gdx.gl.glClearColor(0, 0, 0, 1);
        Gdx.gl.glClear(GL20.GL_COLOR_BUFFER_BIT);

        batch.begin();

        drawBackground();

        batch.end();
    }

    private void drawBackground() {
        for (float x = 0; x < Gdx.graphics.getWidth(); x += background.getWidth()) {
            for (float y = 0; y < Gdx.graphics.getHeight(); y += background.getHeight()) {
                batch.draw(background, x, y);
            }
        }

        edgeBlur.draw(batch);
    }
}

但是，这仅打印完全匹配，因此不会打印带有NOC2L，SAMD11的行。从上面的例子中，预期的输出将是第一个文件的前四行：

batch.draw(edgeBlurTex, 0, 0, Gdx.graphics.getWidth(), Gdx.graphics.getHeight());

我希望它仍能完全匹配，因为一些基因名称可能相似 - 例如，可能有一个名为SAMD1的基因，如果我做了模糊匹配，那么我会得到SAMD1，SAMD11等等。所以我需要一些完全匹配但忽略基因名称列中的逗号，或将其视为字段分隔符或类似字符。

提前致谢。

Answer 1

$ cat tst.awk
NR==FNR { genes[$0]; next }
{
    split($7,a,/,/)
    for (i in a) {
        if (a[i] in genes) {
            print
            next
        }
    }
}

$ awk -f tst.awk secondfile.txt firstfile.txt
1   874816  874816  -   T   rs200996316 SAMD11  exonic  ENSG00000187634 frameshift insertion
1   878331  878331  C   T   rs148327885 SAMD11  exonic  ENSG00000187634 nonsynonymous SNV
1   879676  879676  G   A   rs6605067   NOC2L,SAMD11    UTR3    ENSG00000187634,ENSG00000188976
1   879687  879687  T   C   rs2839  NOC2L,SAMD11    UTR3    ENSG00000187634,ENSG00000188976

这也有效：

$ cat tst.awk
NR==FNR { genes[$0]; next }
{
    for (gene in genes) {
        if ($7 ~ "(^|,)"gene"(,|$)") {
            print
            next
        }
    }
}

当第二个文件列包含逗号时，使用awk将一个文件的列匹配到另一个文件的列

1 个答案: