在第一列中加入具有相同值的行

时间:2013-11-06 22:14:38

标签: bash awk

我有一个带有三列的制表符分隔文件(摘录):

AC147602.5_FG004    IPR000146   Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase
AC147602.5_FG004    IPR023079   Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001    IPR002110   Ankyrin repeat
AC148152.3_FG001    IPR026961   PGG domain

我想用bash来解决这个问题:

AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110   Ankyrin repeat IPR026961    PGG domain

因此,如果第一列中的ID在多行中相同,则应为每个ID生成一行,并且所有其他行连接。在示例中,它将提供两行文件。

4 个答案:

答案 0 :(得分:7)

试试这个单行:

 awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file

答案 1 :(得分:0)

无论出于何种原因,awk解决方案在cygwin中对我不起作用。所以我改用了Perl。它连接一个制表符并用\ n

分隔行
cat FILENAME | perl -e 'foreach $Line (<STDIN>) { @Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(@{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", @{$List})."\n"; }'

答案 2 :(得分:0)

将取决于文件大小(和awk限制)

如果太大,这将通过先排序文件来减少awk需求,并且只在内存中保留1 标签进行打印

使用整行修改的后期打印的经典版本

sort YourFile \
 | awk '
      last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next}
      NR > 1 { print Last C; Last = $1; C = ""}
      END { print Last}
      '

另一个版本使用字段和预打印,但较少“人类可读

sort YourFile \
 | awk '
      last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)}
      last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)}
      '

答案 3 :(得分:0)

纯 bash 版本。它没有额外的依赖项,但需要 bash 4.0 或更高版本 (2009) 才能支持关联数组。

一行:

{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[@]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv

可读和注释等价物:

{
  # Define `merged` as an empty associative array.
  declare -A merged
  merged=()

  # Read tab-separated lines. Any leftover fields also end up in `value`.
  while IFS=$'\t' read -r key value
  do
    # Append to any value that's already there, separated by a tab.
    merged[$key]="${merged[$key]}"$'\t'"$value"
  done

  # Loop over the input keys. Note that the order is arbitrary;
  # pipe through `sort` if you want a predictable order.
  for key in "${!merged[@]}"
  do
    # Each value is prefixed with a tab, so no need for a tab here.
    echo "$key${merged[$key]}"
  done
} < INPUT_FILE.tsv