我有一个带有三列的制表符分隔文件(摘录):
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase
AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR002110 Ankyrin repeat
AC148152.3_FG001 IPR026961 PGG domain
我想用bash来解决这个问题:
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain
因此,如果第一列中的ID在多行中相同,则应为每个ID生成一行,并且所有其他行连接。在示例中,它将提供两行文件。
答案 0 :(得分:7)
试试这个单行:
awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file
答案 1 :(得分:0)
无论出于何种原因,awk解决方案在cygwin中对我不起作用。所以我改用了Perl。它连接一个制表符并用\ n
分隔行cat FILENAME | perl -e 'foreach $Line (<STDIN>) { @Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(@{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", @{$List})."\n"; }'
答案 2 :(得分:0)
将取决于文件大小(和awk限制)
如果太大,这将通过先排序文件来减少awk需求,并且只在内存中保留1 标签进行打印
使用整行修改的后期打印的经典版本
sort YourFile \
| awk '
last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next}
NR > 1 { print Last C; Last = $1; C = ""}
END { print Last}
'
另一个版本使用字段和预打印,但较少“人类可读”
sort YourFile \
| awk '
last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)}
last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)}
'
答案 3 :(得分:0)
纯 bash 版本。它没有额外的依赖项,但需要 bash 4.0 或更高版本 (2009) 才能支持关联数组。
一行:
{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[@]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv
可读和注释等价物:
{
# Define `merged` as an empty associative array.
declare -A merged
merged=()
# Read tab-separated lines. Any leftover fields also end up in `value`.
while IFS=$'\t' read -r key value
do
# Append to any value that's already there, separated by a tab.
merged[$key]="${merged[$key]}"$'\t'"$value"
done
# Loop over the input keys. Note that the order is arbitrary;
# pipe through `sort` if you want a predictable order.
for key in "${!merged[@]}"
do
# Each value is prefixed with a tab, so no need for a tab here.
echo "$key${merged[$key]}"
done
} < INPUT_FILE.tsv