如何在两个目录中使用awk进行多文件搜索,仅从第二个目录

时间:2016-05-26 22:45:09

标签: unix awk

重新提出上一个问题,以便更清楚。我正在尝试搜索两个目录中的文件并将匹配的字符串(紧随其后的行+)打印到第二个目录中的新文件中,只要它们匹配第一个目录中的记录。我找到了类似的例子但没有完全相同。我不知道如何使用awk来处理来自不同目录的多个文件,而且我已经折磨自己试图找出它。

目录1,28,000个文件,格式为

>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG

目录2,15个文件,格式为

>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234

期望的输出:

>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234

目录1和2位于我的主目录中:(。/ Test1& ./Test2)

如果有人可以建议命令具体到不同的目录,我将非常感激!目前,当我包含文件路径(例如/Test1/*.fa)时,我收到以下错误:

awk: can't open file /Test1/*.fa

2 个答案:

答案 0 :(得分:0)

你会想要这样的东西(未经测试):

awk '
FNR==1 {
    dirname = FILENAME
    sub("/.*","",dirname)
    if (NR==1) {
        dirname1 = dirname
    }
}
dirname == dirname1 {
    if (FNR % 2) {
        key = $0
    }
    else {
        map[key] = $0
    }
    next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
    print $0 ORS map[$0]
}
' Test1/* Test2/*

鉴于您收到错误消息/usr/bin/awk: Argument list too long,这意味着you're exceeding your shells maximum argument length for a command并且您的28,000个文件位于Test1目录中,请尝试以下操作:

find Test1 -type f -exec cat {} \; |
awk '
NR == FNR {
    if (FNR % 2) {
        key = $0
    }
    else {
        map[key] = $0
    }
    next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
    print $0 ORS map[$0]
}
' - Test2/*

答案 1 :(得分:0)

TXR中的解决方案:

数据:

$ ls dir*
dir1:
file1  file2

dir2:
file1  file2

$ cat dir1/file1
>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG

$ cat dir1/file2
>XYZ
SDOIWEUROIUOIWUEROIWUEROIWUEROIWUEROUIEIDIDIIDFIFI
>MNO
OOIWEPOIUWERHJSDHSDFJSHDF

$ cat dir2/file1
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234

$ cat dir2/file2
>STP
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>MNO
123412341234123412341234123412341234123412341234123412341234123412341234
$

执行命令

$ txr filter.txr dir1/* dir2/*
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>MNO
123412341234123412341234123412341234123412341234123412341234123412341234

filter.txr中的代码:

@(bind want @(hash :equal-based))
@(next :args)
@(all)
@dir/@(skip)
@(and)
@  (repeat :gap 0)
@dir/@file
@    (next `@dir/@file`)
@    (repeat)
>@key
@      (do (set [want key] t))
@    (end)
@  (end)
@(end)
@(repeat)
@path
@  (next path)
@  (repeat)
>@key
@datum
@    (require [want key])
@    (output)
>@key
@datum
@    (end)
@  (end)
@(end)

要将dir1路径与其余路径分开,我们使用@(all)匹配(尝试多个模式分支,必须全部匹配)和两个分支。第一个分支匹配一个@dir/@(skip)模式,将变量dir绑定到以斜杠开头的文本,并忽略其余部分。第二个分支通过@dir/@file匹配整个连续的@(repeat :gap 0)模式序列。因为出现了相同的dir变量已经具有all的第一个分支的绑定,所以这会将匹配限制为相同的目录名。在repeat内,我们通过next递归到每个文件中,并将>分隔的键收集到keep哈希中。之后,我们将剩余的参数作为要处理的文件的路径名处理;他们不必都在同一个目录中。我们在每个模式中扫描>@key模式,后跟@datum行。如果@(require ...)不在key哈希值中,wanted指令将无法匹配,否则我们会转到@(output)