在下面awk
我试图提取并比较$4
中与p.
成星的每个子字符串。如果前三个字母与后三个字母相同(中间有一个数字),则p.
更新为p.(3 letters)(digit)(=)
--- ()
仅显示那里是3个肠,不需要。如果3 letters
不同,则该行不变。在示例中的下面的file
行1中。在我的实际数据中,大约有10,000行,大约有50列,但$4
是唯一一个在ut中有这些值的行,即te p.
p.
的格式将总是三个字母后跟一个1-4位#然后再输三个字母。我认为下面的awk
尝试会提取每个p.
并在;
上拆分,但我不确定如何比较以检查三个字母是否相同。谢谢你:)。
档案 tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110Glu;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110Glu
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
所需的输出 tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
AWK
awk '
BEGIN { OFS="\t" }
$4 ~ /:NM/ {
ostring=""
# split $4 by ";" and cycle through them
nNM=split($4,NM,";")
for (n=1; n<=nNM; n++) {
if (n>1) ostring=(ostring ";") # append ";"
if (match(NM[n],/p[.].*/)) {
# copy up to "p."
ostring=(ostring substr(NM[n],1,RSTART+1))
# Get the substring after "p."
VAL=substr(NM[n],RSTART+2)
# Get its length
lenVAL=length(VAL)
# store aa array
aa=[{while(length($4)=3){print substr($044,1,3);gsub(/^./,"")}]}' file
答案 0 :(得分:1)
扩展GNU awk
解决方案:
awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' file
输出:
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
时间表现测量:
输入testfile
:
$ wc -l testfile
10000 testfile
time(awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' testfile >/dev/null)
real 0m0.269s
user 0m0.256s
sys 0m0.000s
time(awk 'BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }' testfile >/dev/null)
real 0m0.470s
user 0m0.416s
sys 0m0.008s
答案 1 :(得分:1)
使用GNU awk为第3个arg匹配():
Optional<Integer> input = Optional.of(42);
Optional<String> result = transformNullSafe(
input,
new Function<Integer, String>() {
public String apply(Integer i) {
return methodThatMighReturnNull(true, i);
}
});