Question

我尝试编写一个脚本，该脚本从多个文件中获取两列，并将它们水平连接在一起。问题是，列的内容在文件中的顺序不同，因此需要在连接之前对数据进行排序。

这是我到目前为止所提出的：

!/bin/bash

ls *.txt > list

while read line; do
    awk '{print $2}' "$line" > f1
    awk '{print $8}' "$line" > f2
    paste f1 f2 | sort > "$line".output
done < list

ls *.output > list2

head -n 1 list2 > start

while read line; do
    cat "$line" > output
done < start

tail -n +2 list2 > list3

while read line; do
    paste output "$line" | cat > output
done < list3

我的编程可能没有那么高效，但是我做了我想做的事情，但第二行除外，它没有正确地将文件连接在一起。如果我在命令行中输入该行它可以正常工作，但在while循环中它会错过列。

数据文件如下所示：

bundle_id   target_id   length  eff_length  tot_counts  uniq_counts est_counts  eff_counts  ambig_distr_alpha   ambig_distr_beta    fpkm    fpkm_conf_low   fpkm_conf_high  solvable    tpm
1   comp165370_c0_seq1  297 0.000000    0   0   0.000000    0.000000    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    F   0.000000e+00
2   comp75418_c0_seq1   1371    852.132325  35  0   0.005490    0.008832    8.287807e-04    5.283100e+00    4.583199e-04    0.000000e+00    2.425095e-02    T   6.225299e-04
3   comp76235_c0_seq1   1371    871.645349  44  9   43.994510   69.198412   2.002884e+00    3.142003e-04    3.590738e+00    3.516301e+00    3.665174e+00    T   4.877251e+00
4   comp31034_c0_seq1   379 251.335522  14  0   7.049180    10.629771   1.000000e+00    1.000000e+00    1.995307e+00    0.000000e+00    5.957982e+00    F   2.710199e+00
5   comp36102_c0_seq1   379 234.689179  14  0   6.950820    11.224893   1.000000e+00    1.000000e+00    2.107017e+00    0.000000e+00    6.350761e+00    F   2.861933e+00
6   comp26522_c0_seq1   220 0.000000    0   0   0.000000    0.000000    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    F   0.000000e+00
7   comp122428_c0_seq1  624 0.000000    0   0   0.000000    0.000000    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    0.000000e+00    F   0.000000e+00

我需要target_id和eff_counts列。

这不是完整的问题，但我认为我从小做起。后来我希望目标ID只在开头出现一次。我想在新文件中有一个标题，其中包含对特定列有贡献的文件的名称。

target_id             file_1        file_2        file_3
comp26522_c0_seq1     0.000000      [number]      [number]
comp31034_c0_seq1     10.629771     [number]      [number]
comp36102_c0_seq1     11.224893     [number]      [number]
comp75418_c0_seq1     0.008832      [number]      [number]
comp76235_c0_seq1     69.198412     [number]      [number]
comp122428_c0_seq1    0.000000      [number]      [number]
comp165370_c0_seq1    0.000000      [number]      [number]

编辑：我在示例中添加了更多信息。 [编号]只是占位符;实际上，它们的数字与file_1下的行相似。此外，标题＆＃34; file_1＆＃34;将是输入文件的名称。并且应该对target_id进行排序。所有文件应包含相同的target_id，但所有文件的顺序不同。

编辑二：输出

我测试了四个文件，输出如下：

    comp0_c0_seq1   0.000000
    comp100000_c0_seq1      1.919404
    comp100002_c0_seq1      2.118776
    comp100003_c0_seq1      0.072916
    comp100004_c0_seq1      0.000000
    comp100005_c0_seq1      0.000000
    comp100006_c0_seq1      1.548160
    comp100007_c0_seq1      7.616481
    comp100008_c0_seq1      0.000000
    comp100009_c0_seq1      1.374209

第一列左侧有一个带有数据的空列。只有最后一个文件中的数据存在。

感谢您的帮助！

更新

我解决了第二行的问题。这是我使用的代码：

while read line; do
     join output "$line" > output2
     cat output2 > output
done < list3

这是输出：

comp0_c0_seq1      0.000000 0.000000 0.000000 0.000000
comp100000_c0_seq1 1.919404 1.919404 0.000000 1.919404
comp100002_c0_seq1 2.118776 2.118776 2.225852 2.118776
comp100003_c0_seq1 0.072916 0.072916 1.228136 0.072916
comp100004_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100005_c0_seq1 0.000000 0.000000 1.982851 0.000000
comp100006_c0_seq1 1.548160 1.548160 1.902749 1.548160
comp100007_c0_seq1 7.616481 7.616481 0.000000 7.616481
comp100008_c0_seq1 0.000000 0.000000 0.000000 0.000000
comp100009_c0_seq1 1.374209 1.374209 1.378667 1.374209

现在我只需要弄清楚如何将包含所有文件名的标题添加到文件的顶部。

Answer 1

你也可以从下面的文件名和感兴趣的列开始，然后使用像这样的解决方案转置它：Transpose CSV data with awk (pivot transformation)

find . -name "bundle*.txt" -exec awk 'NR>1 {print FILENAME,$2,$8}' {} \; | sed 's/.\//''/' > superbundle.txt

解释
- 查找名称为bundle * .txt的所有文件 - 执行一个awk语句，显示文件名和第2列和第8列（没有标题）
- 使用sed从文件名中删除./

现在我们可以使用＆＃34; superbundle.txt＆＃34;并使用jaypal中提到的解决方案转置它。

$ cat transpose.awk
{
    if(!($1 in filenames)) { filename[++types] = $1 }; filenames[$1]++
    if(!($2 in target_ids)) { target_id[++num] = $2 }; target_ids[$2]++
    map[$1,$2] = $3
}
END {
    printf "%s\t" ,"target_id";
    for(ind=1; ind<=types; ind++) {
        printf "%s%s", sep, filename[ind];
        sep = "\t"
    }
    print "";
    for(target=1; target<=num; target++) {
        printf "%s", target_id[target]
        for(val=1; val<=types; val++) {
            printf "%s%s", sep, map[filename[val], target_id[target]];
        }
        print ""
    }
}

下面的输出只显示三个文件，因为我只创建了3个包示例文本文件。

$ awk -f transpose.awk superbundle.txt | column -t
target_id           bundle.txt  bundle2.txt  bundle3.txt
comp165370_c0_seq1  0.000000    1.000000     0.000000
comp75418_c0_seq1   0.008832    2.008832     1.008832
comp76235_c0_seq1   69.198412   3.198412     2.198412
comp31034_c0_seq1   10.629771   4.629771     3.629771
comp36102_c0_seq1   11.224893   5.224893     4.224893
comp26522_c0_seq1   0.000000    6.000000     4.000000
comp122428_c0_seq1  0.000000    7.000000     4.000000

Answer 2

经过大量阅读和测试后，我终于想出了一个完全符合我想要的脚本。

它可能不是最有效地使用bash的地方，但它的工作正常。

ls *.xprs > list

while read line; do
    echo "parsing $line"
    awk '{print $2}' "$line" > f1
    awk '{print $8}' "$line" > f2
    paste f1 f2 | sort | head -n -1 > "$line".output
done < list

ls *.output > list2

head -n 1 list2 > start

while read line; do
    cat "$line" > output
done < start

tail -n +2 list2 > list3

while read line; do
    join output "$line" > output2 2>/dev/null
    cat output2 > output
done < list3
sed '1i Contig_ID' list2 | awk '{printf("%s ", $0)}' | sed -e '$a\' | sed 's/.xprs.output//g' > list4

cat list4 output > results.txt

从多个文件中水平合并列

2 个答案: