如何在带有shell的文件中包含列的子串的一部分

时间:2017-03-13 11:52:08

标签: bash shell text-extraction

我有一个问题,

我有一个名为variants.txt的文件,里面有这个文字:

select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42132048_42132049insT";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42132048_42132049insTT";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42132048_42132049delTT";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131884_42131885insT";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131540_42131541delTC";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131420T>C";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131222G>A";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131145T>C";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131125C>G";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131122A>C";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131119G>A";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131118T>C";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131112G>C";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131111T>C";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131067G>A";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131066G>A";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131063G>A";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131059C>T";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131058C>G";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131023C>G";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131016T>C";
select chrom,chromStart,chromEnd,name from snp147 where name="rs138100349   ";
select chrom,chromStart,chromEnd,name from snp147 where name="rs118203758 ";

如果最后一列(以name=开头)包含子串g.,我想匹配,如果是,则在g.和尾随";之间打印所有内容另一个文件。

例如:

select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42132048_42132049insT";
select chrom,chromStart,chromEnd,name from snp147 where name="NC_000022.11:g.42131125C>G";

我想:

42132048_42132049insT    
42131125C>G

我该怎么做?

3 个答案:

答案 0 :(得分:2)

尝试:

awk '{num=sub(/.*:g\./,"");num+=sub(/\".*/,"");if(num==2){print};num=""}'  Input_file

答案 1 :(得分:1)

仔细选择输入字段分隔符regex(通过-F)可以得到一个简单的解决方案:

awk -F':g\.|";' 'NF>2 {print $2}' file
  • 正则表达式(正则表达式):g\.|";将每个输入行按文字:g.或(|)文字";拆分为字段,将感兴趣的行拆分为(至少) 3 字段,其中提取的子字符串包含在 2nd 字段($2)中。

  • NF>2仅匹配至少包含3个字段的行(NF是字段数),这可确保忽略不包含感兴趣子字符串的行。

答案 2 :(得分:0)

您可以使用awkgrepsed执行此操作:

awk -F'name=' '{print $2}' variants.txt | awk -F'g.' '{print $2}' | sed -e 's/";//g'

那是:

  1. 将原始文件中的字符串从“name =”收集到最后。

  2. 只获取字符串为“g”的行。

  3. 取自“g”。到最后

  4. 删除最后的“和;字符以获取示例中提到的输出。