Question

我是编程新手，所以我可能需要解释每一步，我有一个问题：

说我有这些（制表符分隔）文件：

genelist.txt包含：

start_position end_position description 1 840 putative replication protein 1839 2030 hypothetical protein 2095 2328 hypothetical protein 3076 4020 transposase 4209 4322 hypothetical protein

a.txt包含：

NA1.fa NA1:0-840 scaffold40|size16362 100.000 NA1:1838-2030 scaffold40|size16362 100.000 NA1:3075-4020 scaffold40|size16362 100.000 NA1:4208-4322 scaffold40|size16362 92.105

b.txt包含：

NA4.fa NA4:1838-2030 scaffold11|size142511 84.707 NA4:2094-2328 scaffold11|size142511 84.599 NA4:3075-4020 scaffold11|size142511 84.707

我想要的输出是：

start_position end_position description NA1 NA4 1 840 putative replication protein 100 - 1839 2030 hypothetical protein 100 84.707 2095 2328 hypothetical protein - 84.599 3076 4020 transposase 100 84.707 4209 4322 hypothetical protein 92.105 -

基本上，我想根据最终位置匹配基因，并根据相应的ID并排打印出（第3个字段的）匹配百分比，这样我就可以得到他们的百分比同一性的比较表。如果没有匹配，请打印＆＃39; - ＆＃39;或者＆＃39; 0＆＃39; 0所以我知道哪个确实匹配，哪个不匹配。

我打开bash / regex / perl / python或任何类型的脚本来完成这项工作。如果以前曾经问过这个问题，我会道歉但到目前为止我找不到任何解决方案。希望我的问题很清楚。

提前致谢！

Answer 1

那是一个挑战。所以这是代码：

#!/bin/bash
#
# Process genelist file
#
################################################################################

usage()
{
    echo "process.bash <GENELIST> <DATAFILE1> [<DATAFILE n>]"
    echo "Requires at least the genelist and 1 data file."
    exit 1
}

# Process arguments
if [ $# -lt 2 ]
then
    usage
else
    genelistfile=$1
    # Remove the fist argument from $*
    shift
    datafiles=$*
fi

# Setup the output file ########################################################
processdate=$(date +%Y%M%d-%H%m%S)
outputfile="process_$processdate.out"

# Build the header:
#   the first line of the genelist.txt
#   and the first line of each datafile (processed)
header="start_position\tend_position\tdescription"
for datafile in $datafiles
do
    datafileheader=$(grep -v ":" $datafile | cut -d'.' -f1)
    header="$header\t$datafileheader"
done
echo -e $header >$outputfile

# Process the genelistfile #####################################################

# Read each line from the genelistfile
while read -r line
do
    # Do nothing with the header line
    if [ $(echo $line | grep -c start_position) -gt 0 ]
    then
        continue
    fi

    # Setup the output line, which is the line from genelistfile
    # The program will add values from the datafiles as they are processed
    outputline=$line

    # Extract the second field in the line, endposition
    endposition=$(echo $line | awk '{print $2}')

    # loop on each file in argument
    for datafile in $datafiles
    do
        foundsomething='false'

        # for each line in the datafile...
        while read -r line2
        do
            # If the line is a range line, process it
            if [ $(echo $line2 | grep -c ":") -gt 0 ]
            then
                # Extract the range
                startrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f1)
                endrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f2)
                #echo "range= $startrange --> $endrange"

                # Verify if endposition fits within the range...
                if [ $endposition -ge $startrange -a $endposition -le $endrange ]
                then
                    percentage=$(echo $line2 | awk '{print $3}')
                    outputline="$outputline\t$percentage"
                    foundsomething='true'
                fi
            fi
        done < $datafile

        # When done processing the file, we must check if something was found
        if [ $foundsomething == 'false' ]
        then
            outputline="$outputline\t-"
        fi
    done

    # When done processing that line from genelist, output it
    echo -e $outputline >>$outputfile

done < $genelistfile

我已经提出了很多意见来解释发生了什么，但我在这里采取了一些假设来简化代码：

所有数据文件的第一行都带有SOMETHING1.SOMETHING2。我将SOMETHING1作为列标题。
同一文件中不会有NA1和NAx混合数据。
范围数据始终指定为NAx：start-end。
从范围数据中提取的值始终是行中的第3个元素。

我的样本数据对我有用。玩得开心！

为多个文件

1 个答案: