搜索.txt文件以获取特定行并输出到新文件

时间:2017-06-21 11:05:48

标签: linux shell awk grep

我有一个非常大的.txt文件,分为由(我认为)标签分隔的列。

列是这样的(通过运行head NAMEOFTEXTFILE.txt获得):

CANVAS_CHROM    CANVAS_START    CANVAS_END      CANVAS_GT       CANVAS_RC       CANVAS_BC       CANVAS_CN       CANVAS_FILTER  CANVAS_QUAL      BRIDGE_ID       ILMN_ID PROJECT InferredSex     ensembl_gene_IDs        external_gene_IDs       Gene_database  Gene_biotype     overlap_ctrl    overlap_internal        overlap_PAR

我想通过Bridge ID过滤此内容。每个ID是疾病的三个字母的首字母缩写词,例如IDM用于"胰岛素依赖型糖尿病"然后是该特定患者的序列号,IDM131289748937将成为该患者的遗传数据。表中的每一行代表不同的突变。

然后我想将其输出到.txt文件。

到目前为止(使用命令行)我尝试过:

grep "IDM" $(find .. -name 'NAMEOFTEXTFILE.txt') > filtereddata.txt

但这会输出一个乱码列表。

我也尝试过:

`awk '/IDM3/' NAMEOFTEXTFILE.txt` > filtereddata.txt 

这也没用。

我想知道哪种功能最适合这项任务。

我附上了原始文本文件的示例:

  

CANVAS_CHROM CANVAS_START CANVAS_END CANVAS_GT CANVAS_RC CANVAS_BC CANVAS_CN CANVAS_FILTER CANVAS_QUAL BRIDGE_ID ILMN_ID PROJECT InferredSex ensembl_gene_IDs external_gene_IDs Gene_database Gene_biotype overlap_ctrl overlap_internal overlap_PAR   1 825226 916134 0/1 145 87 3。 53 M006429 LP2000749-DNA_C03 PMG F ENSG00000272438,ENSG00000230699,ENSG00000241180,ENSG00000223764,ENSG00000187634,ENSG00000268179,ENSG00000188976,ENSG00000187961,ENSG00000187583,ENSG00000187642 RP11-54O7.16,RP11-54O7.1,RP11-54O7.2,RP11-54O7.3, SAMD11,AL645608.1,NOC2L,KLHL17,PLEKHN1,C1orf170基于克隆的(Vega),基于克隆的(Vega),基于克隆的(Vega),基于克隆的(Vega),HGNC符号,基于克隆的(Ensembl) ,HGNC符号,HGNC符号,HGNC符号,HGNC符号lincRNA,lincRNA,lincRNA,lincRNA,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码0 1 0   1 826236 3641787 0/1 126 2655 3。 61 E009248 LP2000862-DNA_G05 PAH˚FENSG00000272438,ENSG00000230699,ENSG00000241180,ENSG00000223764,ENSG00000187634,ENSG00000268179,ENSG00000188976,ENSG00000187961,ENSG00000187583,ENSG00000187642,ENSG00000272512,ENSG00000188290,ENSG00000231702,ENSG00000224969,ENSG00000187608,ENSG00000188157,ENSG00000242590,ENSG00000217801,ENSG00000273443,ENSG00000237330,ENSG00000131591,ENSG00000223823 ,ENSG00000207730,ENSG00000207607,ENSG00000198976,ENSG00000272141,ENSG00000205231,ENSG00000162571,ENSG00000186891,ENSG00000186827,ENSG00000078808,ENSG00000176022,ENSG00000184163,ENSG00000260179,ENSG00000160087,ENSG00000230415,ENSG00000162572,ENSG00000131584,ENSG00000169972,ENSG00000127054,ENSG00000240731,ENSG00000224051,ENSG00000169962,ENSG00000107404,ENSG00000162576,ENSG00000175756,ENSG00000223663 ,ENSG00000221978,ENSG00000224870,ENSG00000242485,ENSG00000264293,ENSG00000272455,ENSG00000235098,ENSG00000225905,ENSG00000205116,ENSG00000225285,ENSG00000179403,ENSG00000215915,ENSG00000160072,ENS G00000197785,ENSG00000205090,ENSG00000160075,ENSG00000215791,ENSG00000215014,ENSG00000236684,ENSG00000228594,ENSG00000272106,ENSG00000197530,ENSG00000189409,ENSG00000248333,ENSG00000272004,ENSG00000189339,ENSG00000269737,ENSG00000269227,ENSG00000215914,ENSG00000008128,ENSG00000268575,ENSG00000227775,ENSG00000215790,ENSG00000008130,ENSG00000078369,ENSG00000231050,ENSG00000169885,ENSG00000178821, ENSG00000142609,ENSG00000233542,ENSG00000187730,ENSG00000226969,ENSG00000067606,ENSG00000271806,ENSG00000182873,ENSG00000162585,ENSG00000269554,ENSG00000203301,ENSG00000243558,ENSG00000234396,ENSG00000157933,ENSG00000116151,ENSG00000272161,ENSG00000269753,ENSG00000269896,ENSG00000238240,ENSG00000272420,ENSG00000271921,ENSG00000271847,ENSG00000178642,ENSG00000157916,ENSG00000157911,ENSG00000149527, ENSG00000224387,ENSG00000229393,ENSG00000157881,ENSG00000197921,ENSG00000272449,ENSG00000238164,ENSG00000157873,ENSG00000225931,ENSG00000228037,ENSG00000157870,ENSG00000142606,ENSG00000237058,ENSG0000021 5912,ENSG00000233234,ENSG00000231630,ENSG00000169717,ENSG00000177133,ENSG00000256761,ENSG00000142611,ENSG00000226286,ENSG00000272235,ENSG00000130762,ENSG00000272088,ENSG00000162591,ENSG00000207776,ENSG00000238260,ENSG00000158109,ENSG00000116213,ENSG00000078900,ENSG00000227589,ENSG00000235131 RP11-54O7.16,RP11-54O7.1,RP11 -54O7.2,RP11-54O7.3,SAMD11,AL645608.1,NOC2L,KLHL17,PLEKHN1,C1orf170,RP11-54O7.17,HES4,RP11-54O7.10,RP11-54O7.11,ISG15,AGRN,RP11 -54O7.14,RP11-465B22.3,RP11-54O7.18,RNF223,C1orf159,RP11-465B22.5,了miR200b,MIR200A,MIR429,RP11-465B22.8,TTLL10-AS1,TTLL10,TNFRSF18,TNFRSF4,SDF4 ,B3GALT6,FAM132A,RP5-902P8.12,UBE2J2,RP5-902P8.10,SCNN1D,ACAP3,PUSL1,CPSF3L,RP5-890O3.9,GLTPD1,TAS1R3,DVL1,MXRA8,AURKAIP1,RP5-890O3.3,CCNL2 ,RP4-758J18.2,MRPL20,RN7SL657P,RP4-758J18.13,ANKRD65,RP4-758J18.7,TMEM88B,RP4-758J18.10,VWA1,ATAD3C,ATAD3B,ATAD3A,TMEM240,SSU72,AL645728.2,AL645728 0.1,AL645728.3,C1orf233,RP11-345P4.9,MIB2,MMP23B,CDK11B,RP11-345P4.10,SLC35E2B,RP11-345P4.7,RP11-345P4.6,MMP23A,CDK11A,RP1 -283E3.8,RP1-283E3.4,SLC35E2,NADK,GNB1,RP1-140A9.1,CALML6,TMEM52,C1orf222,RP11-547D24.1,GABRD,RP11-547D24.3,PRKCZ,RP5-892K4.1 ,RP11-181G12.2,C1orf86,AL590822.2,AL590822.1,RP11-181G12.5,RP11-181G12.4,SKI,MORN1,RP4-713A8.1,AL589739.1,RP4-740C4.6,RP4 -740C4.5,RP4-740C4.7,RP4-740C4.9,RP4-740C4.8,AL513477.1,RER1,PEX10,PLCH2,RP3-395M20.2,RP3-395M20.3,PANK4,HES5,RP3 -395M20.12,RP3-395M20.8,TNFRSF14,RP3-395M20.7,RP3-395M20.9,FAM213B,MMEL1,RP13-436F16.1,TTC34,RP11-740P5.2,RP11-740P5.3,ACTRT2 ,LINC00982,AL008733.1,PRDM16,RP1-163G9.2,RP11-22L13.1,ARHGEF16,RP11-168F9.2,MEGF6,MIR551A,RP11-46F15.2,TPRG1L,WRAP73,TP73,RP5-1092A11.5 ,RP5-1092A11.2基于克隆(Vega),基于克隆(Vega),基于克隆(Vega),基于克隆(Vega),HGNC符号,基于克隆(Ensembl),HGNC符号,HGNC符号, HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Vega),基于克隆(Vega),基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,HGN C符号,HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega) ,HGNC符号,HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega) ),HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号, HGNC符号,基于克隆(Ensembl),基于克隆(Ensembl),基于克隆(Ensembl),HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega), HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,基于克隆( Vega),HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,克隆基于克隆的(Ensembl),基于克隆的(Vega),基于克隆的(Vega),HGNC符号,HGNC符号,基于克隆的(Vega),基于克隆的(Ensembl),基于克隆的(Vega),基于克隆(Vega),基于克隆(Vega),基于克隆(Vega),基于克隆(Vega),基于克隆(Ensembl),HGNC符号,HGNC符号,HGNC符号,克隆 - 基于(Vega),基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Ensembl),HGNC符号,基于克隆(Vega) ),基于克隆(Vega),HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),克隆 - 基于(Vega)lincRNA,lincRNA,lincRNA,lincRNA,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,lincRNA,蛋白质编码,假基因,反义,蛋白质编码,蛋白质编码,sense_intronic,pseudog烯,lincRNA,protein_coding,protein_coding,lincRNA,miRNA的,miRNA的,miRNA的,lincRNA,反义,protein_coding,protein_coding,protein_coding,protein_coding,protein_coding,protein_coding,lincRNA,protein_coding,lincRNA,protein_coding,protein_coding,protein_coding,protein_coding,sense_intronic,protein_coding, protein_coding,protein_coding,protein_coding,protein_coding,假基因,protein_coding,protein_coding,protein_coding,misc_RNA,lincRNA,protein_coding,反义,protein_coding,lincRNA,protein_coding,protein_coding,protein_coding,protein_coding,protein_coding,protein_coding,假基因,protein_coding,假基因,protein_coding,反义, protein_coding,protein_coding,protein_coding,反义,protein_coding,反义,假基因,假基因,protein_coding,processed_transcript,假基因,protein_coding,protein_coding,protein_coding,反义,protein_coding,protein_coding,protein_coding,反义,protein_coding,反义,protein_coding,反义,反义,protein_coding, protein_coding,protein_coding,lincRN A,lincRNA,protein_coding,protein_coding,sense_intronic,protein_coding,processed_transcript,假基因,sense_intronic,sense_intronic,sense_intronic,假基因,protein_coding,protein_coding,protein_coding,反义,反义,protein_coding,protein_coding,lincRNA,processed_transcript,protein_coding,反义,反义,protein_coding, protein_coding,antisense,protein_coding,lincRNA,lincRNA,protein_coding,antisense,pseudogene,protein_coding,antisense,lincRNA,protein_coding,lincRNA,protein_coding,miRNA,antisense,protein_coding,protein_coding,protein_coding,antisense,antisense 0 1 0   1 969935 1231975 0/1 145 252 3。 61 E005981 LP2000742-DNA_D01 PAH中号ENSG00000188157,ENSG00000242590,ENSG00000217801,ENSG00000273443,ENSG00000237330,ENSG00000131591,ENSG00000223823,ENSG00000207730,ENSG00000207607,ENSG00000198976,ENSG00000272141,ENSG00000205231,ENSG00000162571,ENSG00000186891,ENSG00000186827,ENSG00000078808,ENSG00000176022,ENSG00000184163,ENSG00000260179,ENSG00000160087,ENSG00000230415,ENSG00000162572 ,ENSG00000131584 AGRN,RP11-54O7.14,RP11-465B22.3,RP11-54O7.18,RNF223,C1orf159,RP11-465B22.5,MIR200B,MIR200A,MIR429,RP11-465B22.8,TTLL10-AS1,TTLL10, TNFRSF18,TNFRSF4,SDF4,B3GALT6,FAM132A,RP5-902P8.12,UBE2J2,RP5-902P8.10,SCNN1D,ACAP3 HGNC符号,基于克隆(Vega),基于克隆(Vega),基于克隆(Vega) ,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,HGNC符号,基于克隆(Vega),HGNC符号,基于克隆(Vega),HGNC符号,HGNC符号protein_coding,sense_intronic,ps eudogene,lincRNA,protein_coding,protein_coding,lincRNA,miRNA,miRNA,miRNA,lincRNA,反义,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,蛋白质编码,lincRNA,蛋白质编码,lincRNA,蛋白质编码,蛋白质编码0 2 0   1 1025358 1068256 0/1 141 43 3。 25 G012138 LP2000955-DNA_A12 SPEED M ENSG00000131591 C1orf159 HGNC符号protein_coding 0 4 0   1 1027213 1054981 0/1 122 31 3。 17 C003646 LP2000719-DNA_D01 GEL F ENSG00000131591 C1orf159 HGNC符号protein_coding 0 6 0   1 1027429 1054789 0/1 120 30 3。 17 C003121 LP2000712-DNA_D08 GEL F ENSG00000131591 C1orf159 HGNC符号protein_coding 0 6 0   1 1027747 1054977 0/1 127 27 3。 15 C001669 LP2000262-DNA_B10 GEL F ENSG00000131591 C1orf159 HGNC符号protein_coding 0 6 0   1 1028234 1047162 0/1 116 21 3。 11 C002886 LP2000275-DNA_C06 GEL M ENSG00000131591 C1orf159 HGNC符号protein_coding 0 6 0   1 1028342 1046413 0/1 122 20 3。 11 C001874 LP2000266-DNA_H03 GEL F ENSG00000131591 C1orf159 HGNC符号protein_coding 0 6 0

2 个答案:

答案 0 :(得分:0)

当你使用简单的grep" IDM"它搜索字符串" IDM"在一个单词的任何部分(即开头,中间等)

因为我不确定文字" IDM31234234"确切地存在于一条线上。因此,我假设Line以" IDM31234234"。

开头
egrep "^IDM[0-9]+ " inputfile > output file #this searches for the line which starts with IDM with one minimal integer and there is a space after that

如果您确定整数的数量,那么您可以按以下方式进行

 egrep "^IDM[0-9]{7} "  inputfile > output file # IDM with 7 integers
 egrep "^IDM[0-9]{7,} "  inputfile > output file # IDM with minimal 7 integers

简单来说,使用正则表达式可以更有效地缩小搜索范围。希望这会有所帮助。

答案 1 :(得分:0)

将文件拆分为行中一个单词和IDM

grep
tr " " "\n" < NAMEOFTEXTFILE.txt | grep IDM > OUTPUTFILE.txt

使用由标签分隔的"\t"列,如果列以空格分隔,则使用" "