我是编程新手,所以我可能需要解释每一步,我有一个问题:
说我有这些(制表符分隔)文件:
start_position end_position description
1 840 putative replication protein
1839 2030 hypothetical protein
2095 2328 hypothetical protein
3076 4020 transposase
4209 4322 hypothetical protein
NA1.fa
NA1:0-840 scaffold40|size16362 100.000
NA1:1838-2030 scaffold40|size16362 100.000
NA1:3075-4020 scaffold40|size16362 100.000
NA1:4208-4322 scaffold40|size16362 92.105
NA4.fa
NA4:1838-2030 scaffold11|size142511 84.707
NA4:2094-2328 scaffold11|size142511 84.599
NA4:3075-4020 scaffold11|size142511 84.707
我想要的输出是:
start_position end_position description NA1 NA4
1 840 putative replication protein 100 -
1839 2030 hypothetical protein 100 84.707
2095 2328 hypothetical protein - 84.599
3076 4020 transposase 100 84.707
4209 4322 hypothetical protein 92.105 -
基本上,我想根据最终位置匹配基因,并根据相应的ID并排打印出(第3个字段的)匹配百分比,这样我就可以得到他们的百分比同一性的比较表。如果没有匹配,请打印' - '或者' 0' 0所以我知道哪个确实匹配,哪个不匹配。
我打开bash / regex / perl / python或任何类型的脚本来完成这项工作。如果以前曾经问过这个问题,我会道歉但到目前为止我找不到任何解决方案。希望我的问题很清楚。
提前致谢!
答案 0 :(得分:0)
那是一个挑战。所以这是代码:
#!/bin/bash
#
# Process genelist file
#
################################################################################
usage()
{
echo "process.bash <GENELIST> <DATAFILE1> [<DATAFILE n>]"
echo "Requires at least the genelist and 1 data file."
exit 1
}
# Process arguments
if [ $# -lt 2 ]
then
usage
else
genelistfile=$1
# Remove the fist argument from $*
shift
datafiles=$*
fi
# Setup the output file ########################################################
processdate=$(date +%Y%M%d-%H%m%S)
outputfile="process_$processdate.out"
# Build the header:
# the first line of the genelist.txt
# and the first line of each datafile (processed)
header="start_position\tend_position\tdescription"
for datafile in $datafiles
do
datafileheader=$(grep -v ":" $datafile | cut -d'.' -f1)
header="$header\t$datafileheader"
done
echo -e $header >$outputfile
# Process the genelistfile #####################################################
# Read each line from the genelistfile
while read -r line
do
# Do nothing with the header line
if [ $(echo $line | grep -c start_position) -gt 0 ]
then
continue
fi
# Setup the output line, which is the line from genelistfile
# The program will add values from the datafiles as they are processed
outputline=$line
# Extract the second field in the line, endposition
endposition=$(echo $line | awk '{print $2}')
# loop on each file in argument
for datafile in $datafiles
do
foundsomething='false'
# for each line in the datafile...
while read -r line2
do
# If the line is a range line, process it
if [ $(echo $line2 | grep -c ":") -gt 0 ]
then
# Extract the range
startrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f1)
endrange=$(echo $line2 | awk '{print $1}' | cut -d':' -f2 | cut -d'-' -f2)
#echo "range= $startrange --> $endrange"
# Verify if endposition fits within the range...
if [ $endposition -ge $startrange -a $endposition -le $endrange ]
then
percentage=$(echo $line2 | awk '{print $3}')
outputline="$outputline\t$percentage"
foundsomething='true'
fi
fi
done < $datafile
# When done processing the file, we must check if something was found
if [ $foundsomething == 'false' ]
then
outputline="$outputline\t-"
fi
done
# When done processing that line from genelist, output it
echo -e $outputline >>$outputfile
done < $genelistfile
我已经提出了很多意见来解释发生了什么,但我在这里采取了一些假设来简化代码:
我的样本数据对我有用。 玩得开心!