为多个文件

时间:2017-11-07 12:05:17

标签: regex bash shell text-extraction text-manipulation

我是编程新手,所以我可能需要解释每一步,我有一个问题:

说我有这些(制表符分隔)文件:

  1. genelist.txt包含:
  2. start_position end_position description 1 840 putative replication protein 1839 2030 hypothetical protein 2095 2328 hypothetical protein 3076 4020 transposase 4209 4322 hypothetical protein

    1. a.txt包含:
    2. NA1.fa NA1:0-840 scaffold40|size16362 100.000 NA1:1838-2030 scaffold40|size16362 100.000 NA1:3075-4020 scaffold40|size16362 100.000 NA1:4208-4322 scaffold40|size16362 92.105

      1. b.txt包含:
      2. NA4.fa NA4:1838-2030 scaffold11|size142511 84.707 NA4:2094-2328 scaffold11|size142511 84.599 NA4:3075-4020 scaffold11|size142511 84.707

        我想要的输出是:

        start_position end_position description NA1 NA4 1 840 putative replication protein 100 - 1839 2030 hypothetical protein 100 84.707 2095 2328 hypothetical protein - 84.599 3076 4020 transposase 100 84.707 4209 4322 hypothetical protein 92.105 -

        基本上,我想根据最终位置匹配基因,并根据相应的ID并排打印出(第3个字段的)匹配百分比,这样我就可以得到他们的百分比同一性的比较表。如果没有匹配,请打印' - '或者' 0' 0所以我知道哪个确实匹配,哪个不匹配。

        我打开bash / regex / perl / python或任何类型的脚本来完成这项工作。如果以前曾经问过这个问题,我会道歉但到目前为止我找不到任何解决方案。希望我的问题很清楚。

        提前致谢!

1 个答案:

答案 0 :(得分:0)

那是一个挑战。所以这是代码:

#!/bin/bash
#
# Process genelist file
#
################################################################################

usage()
{
    echo "process.bash <GENELIST> <DATAFILE1> [<DATAFILE n>]"
    echo "Requires at least the genelist and 1 data file."
    exit 1
}

# Process arguments
if [ $# -lt 2 ]
then
    usage
else
    genelistfile=$1
    # Remove the fist argument from $*
    shift
    datafiles=$*
fi

# Setup the output file ########################################################
processdate=$(date +%Y%M%d-%H%m%S)
outputfile="process_$processdate.out"

# Build the header:
#   the first line of the genelist.txt
#   and the first line of each datafile (processed)
header="start_position\tend_position\tdescription"
for datafile in $datafiles
do
    datafileheader=$(grep -v ":" $datafile | cut -d'.' -f1)
    header="$header\t$datafileheader"
done
echo -e $header >$outputfile

# Process the genelistfile #####################################################

# Read each line from the genelistfile
while read -r line
do
    # Do nothing with the header line
    if [ $(echo $line | grep -c start_position) -gt 0 ]
    then
        continue
    fi

    # Setup the output line, which is the line from genelistfile
    # The program will add values from the datafiles as they are processed
    outputline=$line

    # Extract the second field in the line, endposition
    endposition=$(echo $line | awk '{print $2}')

    # loop on each file in argument
    for datafile in $datafiles
    do
        foundsomething='false'

        # for each line in the datafile...
        while read -r line2
        do
            # If the line is a range line, process it
            if [ $(echo $line2 | grep -c ":") -gt 0 ]
            then
                # Extract the range
                startrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f1)
                endrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f2)
                #echo "range= $startrange --> $endrange"

                # Verify if endposition fits within the range...
                if [ $endposition -ge $startrange -a $endposition -le $endrange ]
                then
                    percentage=$(echo $line2 | awk '{print $3}')
                    outputline="$outputline\t$percentage"
                    foundsomething='true'
                fi
            fi
        done < $datafile

        # When done processing the file, we must check if something was found
        if [ $foundsomething == 'false' ]
        then
            outputline="$outputline\t-"
        fi
    done

    # When done processing that line from genelist, output it
    echo -e $outputline >>$outputfile

done < $genelistfile

我已经提出了很多意见来解释发生了什么,但我在这里采取了一些假设来简化代码:

  • 所有数据文件的第一行都带有SOMETHING1.SOMETHING2。我将SOMETHING1作为列标题。
  • 同一文件中不会有NA1和NAx混合数据。
  • 范围数据始终指定为NAx:start-end。
  • 从范围数据中提取的值始终是行中的第3个元素。

我的样本数据对我有用。 玩得开心!