感谢@Jose Ricardo Bustos M.使用file1
和file2
帮助导致以下内容:
但是,我似乎无法从BRCA2
file1
中抓取BRCA 1, BRCA2
来file2
(第2行跳过标题)。我不确定这是因为BCRA2
是,
之后的第二个实例,还是$7
是full gene sequence and full deletion/duplication analysis
的问题,full gene sequence
是部分匹配$7
中的整行?谢谢你:)。
文件1
BRCA2
BCR
SCN1A
fbn1
file2的
Tier explanation . List code gene gene name methodology disease
Tier 1 . . 811 DMD dystrophin deletion analysis and duplication analysis, if performed Publication Date: January 1, 2014 Duchenne/Becker muscular dystrophy
Tier 1 . Jan-16 81 BRCA 1, BRCA2 breast cancer 1 and 2 full gene sequence and full deletion/duplication analysis hereditary breast and ovarian cancer
Tier 1 . Jan-16 70 ABL1 ABL1 gene analysis variants in the kinse domane acquired imatinib tyrosine kinase inhibitor
Tier 1 . . 806 BCR/ABL 1 t(9;22) major breakpoint, qualitative or quantitative chronic myelogenous leukemia CML
Tier 1 . Jan-16 85 FBN1 Fibrillin full gene sequencing heart disease
Tier 1 . Jan-16 95 FBN1 fibrillin del/dup heart disease
AWK
awk 'BEGIN{FS=OFS="\t"} # define fs and output
{$0=toupper($0)} # convert all `file1` to uppercase
{$5=toupper($5)} # convert '$5' in `file2` to uppercase
{$7=toupper($7)} # convert '$7' in `file2` to uppercase
FNR==NR{ # process each field in line of `file1`
if(NR>1 && ($7 ~ /FULL GENE SEQUENC/)) { # skip header and check for full gene sequenc or full gene sequencing, using `regexp`
gsub(" ","",$5) #removing white space
n=split($5,v,"/")
d[v[1]] = $4 #from split, first element as key
}
next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1 # print name then default if no match
BRCA2 279
BCR 279
SCN1A 279
FBN1 85
期望的输出
BRCA2 81 --- match in line 2 of $5 in file 2, BRCA 1, BRCA2 and $7 has full gene sequence
BCR 279
SCN1A 279
FBN1 85
答案 0 :(得分:2)
问题在于代码中的以下部分,
gsub(" ","",$5)
n=split($5,v,"/")
d[v[1]] = $4
AFAIK,它正好适用于案例BCR/ABL 1
,但当您将其用于BRCA 1, BRCA2
时,它 NOT 会产生您期望的结果。删除BRCA 1, BRCA2
上的空格将为BRCA1,BRCA2
,并且/
分割将产生相同的字符串BRCA1,BRCA2
本身,因为去限制器是错误的。
所以你需要,
和hash-it再次需要拆分字符串。像,
n=split($5,v,",")
for (i=1; i <= n; i++) {
d[v[i]] = $4
}
现在,d
与d[BRCA1]
和d[BRCA2]
进行了哈希处理。将上述内容与现有代码一起使用。
或者)删除代码
gsub(" ","",$5)
n=split($5,v,"/")
d[v[1]] = $4
完全和做,
gsub(" ","",$5)
n=split($5,v,"\\||,")
for (i=1; i <= n; i++) {
d[v[i]] = $4
}
表示在$5
或|
上拆分,
并循环其内容并将其散列到数组d
。