我用其他信息更新了问题
我有一个.fastq文件,格式如下:
null
对于每个序列,格式都相同(重复4行) 我想做的是在第二行n = 35个字符的窗口中搜索特定的正则表达式模式([AZ] {5,} ACA [AZ] {5,} ACA [AZ] {5,}) ,将其剪切(如果找到)并在上一行末尾进行报告。
到目前为止,我已经写了一堆几乎可以完成我想要的代码的代码。我在script.awk下面报告:
public class ArrayListSerde<T> implements Serde<ArrayList<T>> {
private final Serializer <T> innerSerialiser;
private final Deserializer<T> innerDeserialiser;
public ArrayListSerde(Serde<T> inner) {
innerSerialiser = inner.serializer ();
innerDeserialiser = inner.deserializer();
}
@Override
public Serializer<ArrayList<T>> serializer() {
return new Serializer<ArrayList<T>>() {
@Override
public byte[] serialize(String topic, ArrayList<T> data) {
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
if (data != null ) {
final int size = data.size();
final DataOutputStream dos = new DataOutputStream(baos);
final Iterator<T> iterator = data.iterator();
try {
dos.writeInt(size);
while (iterator.hasNext()) {
final byte[] bytes = innerSerialiser.serialize(topic, iterator.next());
dos.writeInt(bytes.length);
dos.write(bytes);
}
} catch (IOException e) {
throw new RuntimeException("Unable to serialize ArrayList", e);
}
}
return baos.toByteArray();
}
};
}
@Override
public Deserializer<ArrayList<T>> deserializer() {
return new Deserializer<ArrayList<T>>() {
@Override
public ArrayList<T> deserialize(String topic, byte[] data) {
if (data == null || data.length == 0) {
return null;
}
final ArrayList<T> arrayList = new ArrayList<>();
final DataInputStream dataInputStream = new DataInputStream(new ByteArrayInputStream(data));
try {
final int records = dataInputStream.readInt();
for (int i = 0; i < records; i++) {
final byte[] valueBytes = new byte[dataInputStream.readInt()];
dataInputStream.read(valueBytes);
arrayList.add(innerDeserialiser.deserialize(topic, valueBytes));
}
} catch (IOException e) {
throw new RuntimeException("Unable to deserialize ArrayList", e);
}
return arrayList;
}
};
}
}
从这样的文件开始:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
我想获得这样的输出:
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
答案 0 :(得分:0)
我警告你,我想找点乐子,它很扭曲。
awk -v pattern=pattern -v window=15 '
BEGIN{RS="@";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
输入:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
输出:
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
第二个块未更新,因为窗口为15,并且无法在该窗口中找到图案。
我使用变量RS
处理$0
,$1
,$2
,$3
和$4
的整行4行。因为输入文件以RS
开头,而不以RS
结尾,所以我宁愿不要设置ORS
并使用printf
而不是print
。
答案 1 :(得分:0)
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
@M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
写了您发布的代码:
substr($0,0,35)
-awk中的字符串,字段,行号和数组以1而不是0开头,因此应为substr($0,1,35)
。在这种情况下,Awk会补偿您的错误并将其视为您写的是1而不是0,但是习惯于在1
开始一切操作,以避免在重要时出错。for(i=0;i<=1;i++)
-出于相同的原因应为for(i=1;i<=2;i++)
。getline
-不恰当的使用并且在语法上很脆弱,请参阅for(i = 0; i <= 1; i ++)更新-根据您在下面的评论,pattern
实际上是一个正则表达式而不是字符串:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}