我有这样的输入文件,其中 KO ID,例如K00001
,K00002
K00006
,01100metabolicpathway
,01523Antifolateresistance(7)
01522Endocrineresistance(7)
等
01100Metabolicpathways(812)
K00001
Ca_19344,Ca_19730
K00002
Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670
K00008
Ca_20904
K00011
Ca_15431
K00012
Ca_10466,Ca_23867,Ca_06574
K00013
Ca_08009
K00016
Ca_02357,Ca_16304
K00020
Ca_08005
K00021
Ca_10251,Ca_09868
01523Antifolateresistance(7)
K00297
Ca_26773
K00600
Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173
01522Endocrineresistance(7)
K04650
Ca_20380,Ca_04277
这样的期望输出:
K00001 Ca_19344,Ca_19730 01100Metabolicpathways(812)
K00002 Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670 01100Metabolicpathways(812)
K00006 Ca_14695,Ca_21671,Ca_07219,Ca_24024,Ca_23566,Ca_27084 01100Metabolicpathways(812)
K00008 Ca_20904 01100Metabolicpathways(812)
K00011 Ca_15431 01100Metabolicpathways(812)
K00012 Ca_10466,Ca_23867,Ca_06574 01100Metabolicpathways(812)
K00013 Ca_08009 01100Metabolicpathways(812)
K00016 Ca_02357,Ca_16304 01100Metabolicpathways(812)
K00020 Ca_08005 01100Metabolicpathways(812)
K00021 Ca_10251,Ca_09868 01100Metabolicpathways(812)
K00297 Ca_26773 01523Antifolateresistance(7)
K00600 Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173 01523Antifolateresistance(7)
K04650 Ca_20380,Ca_04277 01522Endocrineresistance(7)
我通过将 KO ID带入KO_list.txt
文件,
K00001
K00002
K00006
K00008
K00011
K00012
K00013
K00016
K00020
K00021
for n in `cat KO_list.txt`
do
x=$(cat $2 | grep -w -A1 "^$n" | head -2 | sed ':a;N;$!ba;s/\n/\t/g')
echo -e "$x" | awk 'NF' >> output.txt
done
但它只给我这样的输出
K00001 Ca_19344,Ca_19730
K00002 Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670
K00006 Ca_14695,Ca_21671,Ca_07219,Ca_24024,Ca_23566,Ca_27084
K00008 Ca_20904
K00011 Ca_15431
K00012 Ca_10466,Ca_23867,Ca_06574
K00013 Ca_08009
K00016 Ca_02357,Ca_16304
K00020 Ca_08005
K00021 Ca_10251,Ca_09868
K00297 Ca_26773
K00600 Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173
K04650 Ca_20380,Ca_04277
任何帮助人员
######### 第二部分谢谢大家,我真的很欣赏你所有的valubale评论,它还有第二部分是否有任何方法可以根据Ca-ID进行输出,其中Ca-ID位于第一列并且每个Ca-id分别有信息而不是直接来自输入文件或KO-ID在第一列的所需输出文件中的KO-id,我正在从任何相同的文件中寻找这样的输出。
来自此输入文件
01100Metabolicpathways(812)
K00001
Ca_19344,Ca_19730
01522Endocrineresistance(7)
K04650
Ca_20380,Ca_04277
或从输入文件
创建的输出文件K00001 Ca_19344,Ca_19730 01100Metabolicpathways(812)
K04650 Ca_20380,Ca_04277 01522Endocrineresistance(7)
新的理想输出应该是这样的
Ca_19344 K00001 01100Metabolicpathways(812)
Ca_19730 K00001 01100Metabolicpathways(812)
Ca_20380 K04650 01522Endocrineresistance(7)
Ca_04277 K04650 01522Endocrineresistance(7)
提前致谢
答案 0 :(得分:0)
这是一个便宜的解决方案,但它适用于您的输入:
awk '/^[0-9].*/{ h = $0; next }/^K/{ k = $0; next }{ print k, $0, h }' yourfile
输出:
K00001 Ca_19344,Ca_19730 01100Metabolicpathways(812)
K00002 Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670 01100Metabolicpathways(812)
K00008 Ca_20904 01100Metabolicpathways(812)
K00011 Ca_15431 01100Metabolicpathways(812)
K00012 Ca_10466,Ca_23867,Ca_06574 01100Metabolicpathways(812)
K00013 Ca_08009 01100Metabolicpathways(812)
K00016 Ca_02357,Ca_16304 01100Metabolicpathways(812)
K00020 Ca_08005 01100Metabolicpathways(812)
K00021 Ca_10251,Ca_09868 01100Metabolicpathways(812)
K00297 Ca_26773 01523Antifolateresistance(7)
K00600 Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173 01523Antifolateresistance(7)
所以,这个班轮基本上做的是将某些行捕获到变量中。因此,第一部分/^[0-9].*/{ h = $0; next }
会捕获以类别标识符之类的数字开头的行。 Awk看到行01100metabolicpathway并将其存储到变量h中。然后执行next
命令并且awk读取下一行。当一行以K开头时,第二部分/^K/{ k = $0; next }
执行。就像你的KO id一样。 awk再次将整行存储到变量中并继续下一个变量。现在,当一条线不符合任何提到的标准(以K或数字开头)时,执行las部分{ print k, $0, h }
。然后是变量k的内容,整个当前行和变量h的内容。这导致了期望的输出。
答案 1 :(得分:0)
awk -v OFS=, '/^K/{k=$0;next}/^C/{ print k, $0 " " c; k=""; next}{c=$0}' infile
解释
awk -v OFS=, ' # set output field separator
/^K/{ # if line starts with K
k=$0; # variable k = current line/record
next # go to next line
}
/^C/{ # if line starts with C
print k, $0 " " c; # print variable k, current row and variable c
k=""; # nullify variable k
next # go to next line
}
{
c=$0 # if from above statement line not
# skipped then variable c will be category
}
' infile
输入
$ cat infile
01100Metabolicpathways(812)
K00001
Ca_19344,Ca_19730
K00002
Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670
K00008
Ca_20904
K00011
Ca_15431
K00012
Ca_10466,Ca_23867,Ca_06574
K00013
Ca_08009
K00016
Ca_02357,Ca_16304
K00020
Ca_08005
K00021
Ca_10251,Ca_09868
01523Antifolateresistance(7)
K00297
Ca_26773
K00600
Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173
01522Endocrineresistance(7)
K04650
Ca_20380,Ca_04277
输出
$ awk -v OFS=, '/^K/{k=$0;next}/^C/{ print k,$0" "c ; k=""; next}{c=$0}' infile
K00001,Ca_19344,Ca_19730 01100Metabolicpathways(812)
K00002,Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670 01100Metabolicpathways(812)
K00008,Ca_20904 01100Metabolicpathways(812)
K00011,Ca_15431 01100Metabolicpathways(812)
K00012,Ca_10466,Ca_23867,Ca_06574 01100Metabolicpathways(812)
K00013,Ca_08009 01100Metabolicpathways(812)
K00016,Ca_02357,Ca_16304 01100Metabolicpathways(812)
K00020,Ca_08005 01100Metabolicpathways(812)
K00021,Ca_10251,Ca_09868 01100Metabolicpathways(812)
K00297,Ca_26773 01523Antifolateresistance(7)
K00600,Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173 01523Antifolateresistance(7)
K04650,Ca_20380,Ca_04277 01522Endocrineresistance(7)
- 编辑 -
For new input
$ cat infile_new
K00001 Ca_19344,Ca_19730 01100Metabolicpathways(812)
K04650 Ca_20380,Ca_04277 01522Endocrineresistance(7)
$ awk '{split($2,a,/,/); for(i=1; i in a; i++)print a[i], $1, $3}' infile_new
Ca_19344 K00001 01100Metabolicpathways(812)
Ca_19730 K00001 01100Metabolicpathways(812)
Ca_20380 K04650 01522Endocrineresistance(7)
Ca_04277 K04650 01522Endocrineresistance(7)
答案 2 :(得分:0)
短 awk 解决方案:
awk '/^[0-9]/{ cat=$0;next }cat{ printf "%s %s",$0,(/^Ca/)? cat"\n":"" }' file
输出:
K00001 Ca_19344,Ca_19730 01100Metabolicpathways(812)
K00002 Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670 01100Metabolicpathways(812)
K00008 Ca_20904 01100Metabolicpathways(812)
K00011 Ca_15431 01100Metabolicpathways(812)
K00012 Ca_10466,Ca_23867,Ca_06574 01100Metabolicpathways(812)
K00013 Ca_08009 01100Metabolicpathways(812)
K00016 Ca_02357,Ca_16304 01100Metabolicpathways(812)
K00020 Ca_08005 01100Metabolicpathways(812)
K00021 Ca_10251,Ca_09868 01100Metabolicpathways(812)
K00297 Ca_26773 01523Antifolateresistance(7)
K00600 Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173 01523Antifolateresistance(7)
K04650 Ca_20380,Ca_04277 01522Endocrineresistance(7)
新条件的奖金 解决方案(第二部分):
awk '/^[0-9]/{ cat=$0;next }cat{ if(/^K/){ k=$0;next }
if(/^Ca/){ split($0,a,","); for(i in a) print a[i],k,cat } }' file
输出:
Ca_19344 K00001 01100Metabolicpathways(812)
Ca_19730 K00001 01100Metabolicpathways(812)
Ca_09433 K00002 01100Metabolicpathways(812)
Ca_23715 K00002 01100Metabolicpathways(812)
Ca_15858 K00002 01100Metabolicpathways(812)
Ca_19929 K00002 01100Metabolicpathways(812)
Ca_26670 K00002 01100Metabolicpathways(812)
Ca_20904 K00008 01100Metabolicpathways(812)
Ca_15431 K00011 01100Metabolicpathways(812)
Ca_10466 K00012 01100Metabolicpathways(812)
Ca_23867 K00012 01100Metabolicpathways(812)
Ca_06574 K00012 01100Metabolicpathways(812)
Ca_08009 K00013 01100Metabolicpathways(812)
Ca_02357 K00016 01100Metabolicpathways(812)
Ca_16304 K00016 01100Metabolicpathways(812)
Ca_08005 K00020 01100Metabolicpathways(812)
Ca_10251 K00021 01100Metabolicpathways(812)
Ca_09868 K00021 01100Metabolicpathways(812)
Ca_26773 K00297 01523Antifolateresistance(7)
Ca_00054 K00600 01523Antifolateresistance(7)
Ca_00455 K00600 01523Antifolateresistance(7)
Ca_14951 K00600 01523Antifolateresistance(7)
Ca_11397 K00600 01523Antifolateresistance(7)
Ca_08538 K00600 01523Antifolateresistance(7)
Ca_11540 K00600 01523Antifolateresistance(7)
Ca_11173 K00600 01523Antifolateresistance(7)
Ca_20380 K04650 01522Endocrineresistance(7)
Ca_04277 K04650 01522Endocrineresistance(7)
答案 3 :(得分:0)
$ cat tst.awk
BEGIN { OFS="\t" }
/\(/ { cat=$0; cnt=0; next }
++cnt % 2 { kid=$0; next }
{ print kid, $0, cat }
$ awk -f tst.awk file
K00001 Ca_19344,Ca_19730 01100Metabolicpathways(812)
K00002 Ca_09433,Ca_23715,Ca_15858,Ca_19929,Ca_26670 01100Metabolicpathways(812)
K00008 Ca_20904 01100Metabolicpathways(812)
K00011 Ca_15431 01100Metabolicpathways(812)
K00012 Ca_10466,Ca_23867,Ca_06574 01100Metabolicpathways(812)
K00013 Ca_08009 01100Metabolicpathways(812)
K00016 Ca_02357,Ca_16304 01100Metabolicpathways(812)
K00020 Ca_08005 01100Metabolicpathways(812)
K00021 Ca_10251,Ca_09868 01100Metabolicpathways(812)
K00297 Ca_26773 01523Antifolateresistance(7)
K00600 Ca_00054,Ca_00455,Ca_14951,Ca_11397,Ca_08538,Ca_11540,Ca_11173 01523Antifolateresistance(7)
K04650 Ca_20380,Ca_04277 01522Endocrineresistance(7)