Question

我正在努力编写一个“循环”从一个文件中提取子字符串的脚本，同时获取有关从另一个文件中删除的位置的信息。我正在使用MobaXterm进行bash。我有文件cut_positions.txt，它是制表符分隔的，显示名称，起点，终点，长度，注释：

k141_20066  103484  104617  1133    phnW  
k141_20841  13200   14324   1124    phnW  
k141_23852  69  452 383 phnW  
k141_32328  1   180 179 phnW

和带有名称的string_file.txt（在其中一个文件中删除/添加“＆gt;”并没有问题）和字符串（原始字符串更长，最多1.000.000个字符）：

>k141_10671 CCTTCCCCCACACGCCGCTCTTCCGCTCTTGCTGGCC  
>k141_10707 AGGCGGTATCAGACCTTGCCGCAACACTAAGCCCAGTAACGCTGTCGCCCTTATATCTGA  
>k141_11190 CTTTTGTGACAGTGCAGGGCAATGGTGGATTTATCAGTATCGGGCAGAA  
>k141_1479  AGCCGACAGCAGCGCCGAGGGCACATAATCCGATGACACGATGTCCAAAAGATCCGCCTCGGC

现在我想使用cut_positions.txt中的输入。我想使用第一列匹配右行，然后第二列作为子串的起点，第四列作为子串的长度。这应该在cut_positions.txt中的所有行完成，并写入新的out.txt。为了更接近我尝试（使用我的原始数据）：

➤ grep ">k141_28027\b" test_out_one_line.txt | awk '{print substr($2,57251,69)}'
TCACTTGAGCGCAATTATTCGCTCTCCGGCGGCGTCAGCATCAGCCTGATCATGCGTCACCAAAAGTGT

作为手工制作的方式很好。我也想到了如何访问cut_positions.txt中的不同元素（这里是第二列的第一行）：

awk -F '\t' 'NR==1{print $2}' cut_positions.txt

但我无法弄清楚如何将其变成循环，因为我不知道如何连接我用于小步骤的不同重定向，管道步骤等。非常感谢任何帮助（并告诉我，如果您需要更多样本数据）

感谢 crazysantaclaus

Answer 1

以下脚本适合您：

cut.awk

# We are reading two files: pos.txt and strings.txt
# NR is equal to FNR as long as we are reading the
# first file.
NR==FNR{
    pos[">"$1]=$2 # Store the startpoint in an array pos (indexed by $1)
    len[">"$1]=$4 # Store the length in an array len (indexed by $1)
    next # skip the block below for pos.txt
}

# This runs on every line of strings.txt
$1 in pos {
    # Extract a substring of $2 based on the position and length
    # stored above
    key=$1
    mod=substr($2,pos[key],len[key])
    $2=mod
    print # Print the modified line
}

这样称呼：

awk -f cut.awk pos.txt strings.txt

一个重要的提到的事情。假设字符串从索引substr()开始 - 与大多数编程语言相反，其中字符串从索引1开始。如果0中的排名基于pos.txt，0必须成为：

substr()

我建议使用简化的，有意义的版本来测试它：

pos.txt

mod=substr($2,pos[key]+1,len[key])

和 strings.txt

foo  2  5  3    phnW  
bar  4  5  1    phnW
test 1  5  4    phnW

输出：

>foo 123456  
>bar 123456
>non 123456

如何在一个文件的子串位置上使用info从另一个文件中提取子串（loop，bash）

1 个答案: