Question

我正在尝试使用awk和sed从我的文件中获取一些信息，但不确定如何使它工作。

这是我的数据：

00020dfa-549d-43e4-877d-d3dcbc212fe5    Pleosporales_sp|HE820879|SH1523966.08FU|reps|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__unidentified;g__unidentified;s__Pleosporales_sp   90.099  707 1680    1195    39  24

预期的输出是这样的

00020dfa-549d-43e4-877d-d3dcbc212fe5    k__Fungi;   p__Ascomycota;  c__Dothideomycetes; o__Pleosporales;    f__unidentified;    g__unidentified;    s__Pleosporales_sp

因此，基本上，我只希望前两列的数据，而在第二列中，我只希望从k开头的所有选项“;”后面的信息。

我尝试了如下代码：

awk -F"\t" '{print $1, $2}' infile.tab |
    sed -e '|' -e '|' -e '|' -e '|' -e 'D' > outfile.tab

但是无法获得预期的输出。如果有人可以给我一些建议，将不胜感激！

Answer 1

与awk

$ awk '{gsub(/.*\|/,"",$2);   # remove everything upto the last pipe from $2
        gsub(/;/,";\t",$2);   # add space after semicolon in $2
        print $1 "\t" $2}' file

00020dfa-549d-43e4-877d-d3dcbc212fe5    k__Fungi;       p__Ascomycota;       c__Dothideomycetes;  \
o__Pleosporales;        f__unidentified;     g__unidentified;        s__Pleosporales_sp

Answer 2

也许是一个更简单的AWK，否则与@karakfa的解决方案没有太大不同：

awk '
  BEGIN {
    FS = OFS = "\t"
  }
  {
    sub(/.*\|/, "", $2)
    gsub(/;/, ";\t", $2)
    print $1, $2
  }
  ' infile.tab > outfile.tab

输出：

00020dfa-549d-43e4-877d-d3dcbc212fe5    k__Fungi;       p__Ascomycota;  c__Dothideomycetes;     o__Pleosporales;        f__unidentified;        g__unidentified;      s__Pleosporales_sp

Answer 3

一种client.execute解决方案（如果您使用sed gnu，则可以将所有sed替换为$(printf '\t')：

\t

输出：

sed -E "s/([^[:blank:]]+[[:blank:]]+[^[:blank:]]+[[:blank:]]+).*/\1/;s/[^[:blank:]]*\|//;s/;/;$(printf '\t')/g;s/[[:blank:]]+/$(printf '\t')/;s/[[:blank:]]+$//" infile.tab > outfile.tab

说明：

00020dfa-549d-43e4-877d-d3dcbc212fe5 k__Fungi; p__Ascomycota; c__Dothideomycetes; o__Pleosporales; f__unidentified; g__unidentified; s__Pleosporales_sp仅保留前两个字段
s/([^[:blank:]]+[[:blank:]]+[^[:blank:]]+[[:blank:]]+).*/\1/将删除第二个字段中的所有内容，直到到达s/[^[:blank:]]*\|//
k__Fungi在每个s/;/;$(printf '\t')/g;之后添加标签页
;删除所有由制表符替换的空格，以分隔第一字段和第二个字段（如果这两个字段已经由制表符分隔，则可以省略
s/[[:blank:]]+/$(printf '\t')/删除尾随空白。

如何提取前两列，然后在第二列中删除部分信息？

3 个答案: