匹配特定的模式并仅在上一行中打印匹配的字符串

时间:2019-08-01 11:59:01

标签: awk pattern-matching fastq

我用其他信息更新了问题

我有一个.fastq文件,格式如下:

null

对于每个序列,格式都相同(重复4行) 我想做的是在第二行n = 35个字符的窗口中搜索特定的正则表达式模式([AZ] {5,} ACA [AZ] {5,} ACA [AZ] {5,}) ,将其剪切(如果找到)并在上一行末尾进行报告。

到目前为止,我已经写了一堆几乎可以完成我想要的代码的代码。我在script.awk下面报告:

public class ArrayListSerde<T> implements Serde<ArrayList<T>> {

    private final Serializer  <T> innerSerialiser;
    private final Deserializer<T> innerDeserialiser;

    public ArrayListSerde(Serde<T> inner) {
        innerSerialiser   = inner.serializer ();
        innerDeserialiser = inner.deserializer();
    }

    @Override
    public Serializer<ArrayList<T>> serializer() {
        return new Serializer<ArrayList<T>>() {
            @Override
            public byte[] serialize(String topic, ArrayList<T> data) {
                final ByteArrayOutputStream baos = new ByteArrayOutputStream();
                if (data != null ) {
                    final int size = data.size();
                    final DataOutputStream dos = new DataOutputStream(baos);
                    final Iterator<T> iterator = data.iterator();
                    try {
                        dos.writeInt(size);
                        while (iterator.hasNext()) {
                            final byte[] bytes = innerSerialiser.serialize(topic, iterator.next());
                            dos.writeInt(bytes.length);
                            dos.write(bytes);
                        }
                    } catch (IOException e) {
                        throw new RuntimeException("Unable to serialize ArrayList", e);
                    }
                }
                return baos.toByteArray();
            }
        };
    }

    @Override
    public Deserializer<ArrayList<T>> deserializer() {
        return new Deserializer<ArrayList<T>>() {
            @Override
            public ArrayList<T> deserialize(String topic, byte[] data) {
                if (data == null || data.length == 0) {
                    return null;
                }

                final ArrayList<T> arrayList = new ArrayList<>();
                final DataInputStream dataInputStream = new DataInputStream(new ByteArrayInputStream(data));

                try {
                    final int records = dataInputStream.readInt();
                    for (int i = 0; i < records; i++) {
                        final byte[] valueBytes = new byte[dataInputStream.readInt()];
                        dataInputStream.read(valueBytes);
                        arrayList.add(innerDeserialiser.deserialize(topic, valueBytes));
                    }
                } catch (IOException e) {
                    throw new RuntimeException("Unable to deserialize ArrayList", e);
                }

                return arrayList;
            }
        };
    }
}

从这样的文件开始:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+ 
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)

我想获得这样的输出:

match(substr($0,0,35),/regexp/,a) {
    print p,a[0] #print the previous line respect to the matched one
    print #print the current line
    for(i=0;i<=1;i++) { # print the 2 lines following
        getline
        print
    }
}#store previous line 
{ p = $0 }

2 个答案:

答案 0 :(得分:0)

我警告你,我想找点乐子,它很扭曲。

awk -v pattern=pattern -v window=15 '
BEGIN{RS="@";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file

输入:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..

输出:

@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. 
+ 
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. 
+ 
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..

第二个块未更新,因为窗口为15,并且无法在该窗口中找到图案。

我使用变量RS处理$0$1$2$3$4的整行4行。因为输入文件以RS开头,而不以RS结尾,所以我宁愿不要设置ORS并使用printf而不是print

答案 1 :(得分:0)

$ cat tst.awk
BEGIN {
    tgtStr   = "pattern"
    tgtLgth  = length(tgtStr)
    winLgth  = 35
    numLines = 4
}
{
    lineNr = ( (NR-1) % numLines ) + 1
    rec[lineNr] = $0
}
lineNr == numLines {
    if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
        rec[1] = rec[1] " " tgtStr
        rec[2] = substr(rec[2],idx+tgtLgth)
        rec[4] = substr(rec[4],idx+tgtLgth)
    }
    for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
        print rec[lineNr]
    }
}

$ awk -f tst.awk file
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8  pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..

写了您发布的代码:

  • substr($0,0,35)-awk中的字符串,字段,行号和数组以1而不是0开头,因此应为substr($0,1,35)。在这种情况下,Awk会补偿您的错误并将其视为您写的是1而不是0,但是习惯于在1开始一切操作,以避免在重要时出错。
  • for(i=0;i<=1;i++)-出于相同的原因应为for(i=1;i<=2;i++)
  • getline-不恰当的使用并且在语法上很脆弱,请参阅for(i = 0; i <= 1; i ++)

更新-根据您在下面的评论,pattern实际上是一个正则表达式而不是字符串:

$ cat tst.awk
BEGIN {
    tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
    winLgth   = 35
    numLines  = 4
}
{
    lineNr = ( (NR-1) % numLines ) + 1
    rec[lineNr] = $0
}
lineNr == numLines {
    if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
        rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
        rec[2] = substr(rec[2],RSTART+RLENGTH)
        rec[4] = substr(rec[4],RSTART+RLENGTH)
    }
    for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
        print rec[lineNr]
    }
}