在两个不同结构的文本文件中查找匹配的字符串

时间:2015-03-04 13:20:46

标签: shell

我有一个制表符分隔的文本文件A(代表BLAST输出)

Name1   BBBBBBBBBBBB    99.40   166 1   0   1   166 334 499 3e-82    302
Name2   DDDDDDDDDDDD    98.80   167 2   0   1   167 346 512 4e-81    298

和文本文件B(代表系统发育树状图)看起来像

"Cluster A": {
        "member": {
            "Cluster A": "BBBBBBBBBBBB This is Animal A", 
                   }, 
        "name": "Cluster A"
             }, 
    "Cluster B: {
        "member": {
            "Cluster B": "DDDDDDDDDDDD This is Animal B"
                   }, 
        "name": "cluster B"
                 }

我想获取文本文件A的第二个选项卡中的字符串(例如DDDDDDDDDDD)并在文本文件B中查找。然后,脚本应该将文本文件B中找到的信息添加到文本文件A的新选项卡中:

Name1   BBBBBBBBBBBB    99.40   166 1   0   1   166 334 499 3e-82    302 Cluster A This is Animal A
Name2   DDDDDDDDDDDD    98.80   167 2   0   1   167 346 512 4e-81    298 Cluster B This is Animal B

非常感谢!

3 个答案:

答案 0 :(得分:0)

一些示例代码,用于从两个文件中读取数据 您的示例缺少外部{},这将解析代码添加它的原因。

然后循环集群成员并构建所需的结果

import json                                                                 
import re                                                                   

with open("in1") as blast:                                                  
    blast_data = blast.readlines()                                             

with open("in2") as jsonfile:                                                  
    json_data = json.loads("{%s}" % jsonfile.read())                           

for bdata in blast_data:                                                       
    id = bdata.split()[1]                                                      
    for cluster in json_data:                                                  
        for member in json_data[cluster]['member']:                            
            if id in json_data[cluster]['member'][member]:                     
                print "%s %s %s" % (bdata.strip(), member, re.sub(id, '', json_data[cluster]['member'][member]))
                break

答案 1 :(得分:0)

修复json文件:

$ cat B
[
    { "Cluster A": { "member": { "Cluster A": "BBBBBBBBBBBB This is Animal A" }, "name": "Cluster A" } }, 
    { "Cluster B": { "member": { "Cluster B": "DDDDDDDDDDDD This is Animal B" }, "name": "cluster B" } }
]

然后,perl解决方案:

perl -MJSON -MPath::Class -E '
    my $data = decode_json file("B")->slurp;
    $, = "\t";
    for my $line (file("A")->slurp(chomp => 1)) {
        my @F = split /\t/, $line;
        for my $item (@$data) {
            for my $cluster (keys %$item) {
                while (my ($key, $value) = each %{$item->{$cluster}{member}} ) {
                    if ($value =~ /$F[1]\s+(.*)/) {
                        say $line, $cluster, $1;
                    }
                }
            }
        }
    }
'

输出

Name1   BBBBBBBBBBBB    99.40   166 1   0   1   166 334 499 3e-82   302 Cluster A   This is Animal A
Name2   DDDDDDDDDDDD    98.80   167 2   0   1   167 346 512 4e-81   298 Cluster B   This is Animal B

对于踢,等效的Ruby

ruby -rjson -e '
  data = JSON.load File.new("B")
  File.readlines("A").each {|line|
    line.chomp!
    f = line.split("\t")
    data.each {|obj|
      obj.each_key {|cluster|
        obj[cluster]["member"].each_pair {|key, value| 
          if m = value.match(f[1] + "\s+(.*)")
            puts [line, cluster, m[1]].join("\t")
          end
        }
      }
    }
  }
'

答案 2 :(得分:0)

Shell脚本代码,

#!/usr/bin/ksh
awk '{print $2}' file1 > tmpfile
for i in `cat tmpfile`
do
{
aa=`grep -w $i file2`
awk -v out="$aa" -v pattern="$i" ' $2 ~ pattern { print $0"   "out}' file1}
done

awk '{print $2}' file1 > tmpfile - 从第一个文件中获取模式并存储在tmp文件中 aa=grep -w $i file2 - 匹配文件2中的类似模式,并将整行存储在变量aa中 awk -v out="$aa" -v pattern="$i" ' $2 ~ pattern { print $0" "out}' file1} - 将file2中的字符串追加到其对应的file1

行中