通过使用bash

时间:2017-08-15 12:40:38

标签: bash

我有一个带有222行和7752列的制表符分隔文本文件。以下是前4列和5行作为示例:

Individual Var1 Var2 Var3
personA    A    A    A
personA    T    T    T
personB    G    G    G
personB    C    C    C

我需要将代表每个人的两行合并为一行。也就是说,我需要将每个人的第二行中第一列的值移动到每个人的第一行的第二列,然后在所有行和列中重复此值。所以我最终每人只有一行(111行)并且两列(15,504)。因此,前7列和3行看起来像这样:

Individual Var1 Var1 Var2 Var2 Var3 Var3
personA    A    T    A    T    A    T
personB    G    C    G    C    G    C

如果有人能提出解决方案,也许会使用bash,我将非常感激?我还没有尝试解决方案,因为我对编码很陌生。

4 个答案:

答案 0 :(得分:1)

使用awk:

sed 1p file | awk '
    {
        n = split($0, line1)
        getline
        split($0, line2)
        printf "%s", line1[1]
        for (i=2; i<=n; i++)
            printf "\t%s\t%s", line1[i], line2[i]
        printf "\n"
    }
' | column -t
Individual  Var1  Var1  Var2  Var2  Var3  Var3
personA     A     T     A     T     A     T
personB     G     C     G     C     G     C

sed 1p file是打印标题两次,因此awk可以看到标题两次,就像其他数据一样。

答案 1 :(得分:0)

无论你使用什么语言,逻辑都是一样的。虽然bash对于这类事物来说并不是最快的,但它会正常工作。您只需要为每个人处理标题 line1 line2 的读取顺序,并验证您是否拥有相同的人和该数字字段匹配,然后简单地将这些行交错成一行。

您可以执行以下操作:

\t

注意:打印personX后的其他$ cat data.txt Individual Var1 Var2 Var3 personA A A A personA T T T personB G G G personB C C C 仅用于间距。您应该检查人的长度并根据需要进行调整。您可以根据需要调整间距以使用制表符或空格。

示例输入

$ bash cmblines.sh data.txt
Individual      Var1    Var1    Var2    Var2    Var3    Var3
personA         A       T       A       T       A       T
personB         G       C       G       C       G       C

示例使用/输出

> export WERCKER_STEP_ROOT="/pipeline/script-5ea4a2c6-b11f-4972-991a-eec61b3d43af"
export WERCKER_STEP_ID="script-5ea4a2c6-b11f-4972-991a-eec61b3d43af"
export WERCKER_STEP_OWNER="wercker"
export WERCKER_STEP_NAME="script"
export WERCKER_REPORT_NUMBERS_FILE="/report/script-5ea4a2c6-b11f-4972-991a-eec61b3d43af/numbers.ini"
export WERCKER_REPORT_MESSAGE_FILE="/report/script-5ea4a2c6-b11f-4972-991a-eec61b3d43af/message.txt"
export WERCKER_REPORT_ARTIFACTS_DIR="/report/script-5ea4a2c6-b11f-4972-991a-eec61b3d43af/artifacts"
source "/pipeline/script-5ea4a2c6-b11f-4972-991a-eec61b3d43af/run.sh" < /dev/null
[2017-08-15T13:38:45.071Z] ----------------------------------------------------------------------
[2017-08-15T13:38:45.076Z] Command:       /usr/local/bin/node /usr/local/bin/firebase deploy --project --token --debug
[2017-08-15T13:38:45.076Z] CLI Version:   3.9.2
[2017-08-15T13:38:45.076Z] Platform:      linux
[2017-08-15T13:38:45.076Z] Node Version:  v7.10.1
[2017-08-15T13:38:45.077Z] Time:          Tue Aug 15 2017 13:38:45 GMT+0000 (UTC)
[2017-08-15T13:38:45.077Z] ----------------------------------------------------------------------
[2017-08-15T13:38:45.091Z] > command requires scopes: ["email","openid","https://www.googleapis.com/auth/cloudplatformprojects.readonly","https://www.googleapis.com/auth/firebase","https://www.googleapis.com/auth/cloud-platform"]
[2017-08-15T13:38:45.091Z] > no authorization credentials were supplied or found

⚠  Your CLI authentication needs to be updated to take advantage of new features.
⚠  Please run firebase login --reauth
[2017-08-15T13:38:45.093Z] > command requires scopes: ["email","openid","https://www.googleapis.com/auth/cloudplatformprojects.readonly","https://www.googleapis.com/auth/firebase"]
[2017-08-15T13:38:45.093Z] > no authorization credentials were supplied or found

答案 2 :(得分:0)

我希望这可以帮到你。我将所有内容都放入一个数组中,然后循环使用它。

readarray huge_array < <(tail -n +2 test.txt)

i=0;
while [ $i -lt ${#huge_array[@]} ] 
do
     name_of_person="$( echo ${huge_array[i]} | awk '{print $1}' )"
     rows_person_one="$( echo ${huge_array[i]} |  awk 'BEGIN { ORS=" " };{for (i=2; i<=NF; i++) print $i}')"
     rows_person_one_second_line="$( echo ${huge_array[i+1]} | awk 'BEGIN { ORS=" " };{for (i=2; i<=NF; i++) print $i}')"
     echo "$name_of_person $rows_person_one $rows_person_one_second_line"
     ((i+=2))

 done

test.txt包含:

$ cat> test.txt
Individual Var1 Var2 Var3
personA    A    A    A
personA    T    T    T
personB    G    G    G
personB    C    C    C

输出是:

personA A A A  T T T
personB G G G  C C C

顺便说一下,您可以更改以下内容的打印行:

 printf "$name_of_person \t $rows_person_one \t $rows_person_one_second_line \n"

如果你想要的是一个列表输出。

问候!

修改

对于某些系统中没有“readarray”的情况,您可以使用:

while IFS= read -r line 
do
     huge_array+=("$line") 
done < <(tail -n +2 test.txt)

i=0;
while [ $i -lt ${#huge_array[@]} ] 
do
    name_of_person="$( echo ${huge_array[i]} | awk '{print $1}' )"
    rows_person_one="$( echo ${huge_array[i]} |  awk 'BEGIN { ORS=" " };{for (i=2; i<=NF; i++) print $i}')"
    rows_person_one_second_line="$( echo ${huge_array[i+1]} | awk 'BEGIN { ORS=" " };{for (i=2; i<=NF; i++) print $i}')"
    printf "$name_of_person \t $rows_person_one \t $rows_person_one_second_line \n"
    ((i+=2))

done

答案 3 :(得分:0)

这允许N个人N和varN。

$ /tmp/x.sh
Individual,Var1,Var2,Var3,Var1,Var2,Var3
personA,A,A,A,T,T,T
personB,G,G,G,C,C,C

$ cat /tmp/x.sh
#!/bin/sh

input=/tmp/input.txt

awk '
  NR==1{
    header=$0;
    $1="";
    vars=$0
  }
  NR>1 {
    person=$1;
    if(count[person]){$1=""};
    array[person]=sprintf("%s %s",array[person],$0);
    count[person]++
  }
  END{
    for(i in count){max=(max>c[i])?max:count[i]}
    for(i=2;i<=max;i++){header=sprintf("%s %s",header,vars)}
    print header;
    for (i in array){print array[i]}
  }' $input \
| sed 's/^ *//'\
| tr -s ' ' ','  # ',' because i can see ',' and not tabs on screen