重新提出上一个问题,以便更清楚。我正在尝试搜索两个目录中的文件并将匹配的字符串(紧随其后的行+)打印到第二个目录中的新文件中,只要它们匹配第一个目录中的记录。我找到了类似的例子但没有完全相同。我不知道如何使用awk来处理来自不同目录的多个文件,而且我已经折磨自己试图找出它。
>ABC
KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS
>GHI
OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>DEF
12341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
>ABC
12341234123412341234123412341234123412341234123412341234123412341234
>GHI
12341234123412341234123412341234123412341234123412341234123412341234123412341234
目录1和2位于我的主目录中:(。/ Test1& ./Test2)
如果有人可以建议命令具体到不同的目录,我将非常感激!目前,当我包含文件路径(例如/Test1/*.fa)时,我收到以下错误:
awk: can't open file /Test1/*.fa
答案 0 :(得分:0)
你会想要这样的东西(未经测试):
awk '
FNR==1 {
dirname = FILENAME
sub("/.*","",dirname)
if (NR==1) {
dirname1 = dirname
}
}
dirname == dirname1 {
if (FNR % 2) {
key = $0
}
else {
map[key] = $0
}
next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
print $0 ORS map[$0]
}
' Test1/* Test2/*
鉴于您收到错误消息/usr/bin/awk: Argument list too long
,这意味着you're exceeding your shells maximum argument length for a command并且您的28,000个文件位于Test1目录中,请尝试以下操作:
find Test1 -type f -exec cat {} \; |
awk '
NR == FNR {
if (FNR % 2) {
key = $0
}
else {
map[key] = $0
}
next
}
(FNR % 2) && ($0 in map) && !seen[$0,map[$0]]++ {
print $0 ORS map[$0]
}
' - Test2/*
答案 1 :(得分:0)
TXR中的解决方案:
数据:
$ ls dir* dir1: file1 file2 dir2: file1 file2 $ cat dir1/file1 >ABC KLSDFIOUWERMSDFLKSJDFKLSJDSFKGHGJSNDKMVMFHKSDJFS >GHI OOILKJSDFKJSDFLMOPIWERIOUEWIRWIOEHKJTSDGHLKSJDHGUIYIUSDVNSDG $ cat dir1/file2 >XYZ SDOIWEUROIUOIWUEROIWUEROIWUEROIWUEROUIEIDIDIIDFIFI >MNO OOIWEPOIUWERHJSDHSDFJSHDF $ cat dir2/file1 >ABC 12341234123412341234123412341234123412341234123412341234123412341234 >DEF 12341234123412341234123412341234 >GHI 12341234123412341234123412341234123412341234123412341234123412341234123412341234 $ cat dir2/file2 >STP 12341234123412341234123412341234123412341234123412341234123412341234123412341234 >MNO 123412341234123412341234123412341234123412341234123412341234123412341234 $
执行命令
$ txr filter.txr dir1/* dir2/* >ABC 12341234123412341234123412341234123412341234123412341234123412341234 >GHI 12341234123412341234123412341234123412341234123412341234123412341234123412341234 >MNO 123412341234123412341234123412341234123412341234123412341234123412341234
filter.txr
中的代码:
@(bind want @(hash :equal-based))
@(next :args)
@(all)
@dir/@(skip)
@(and)
@ (repeat :gap 0)
@dir/@file
@ (next `@dir/@file`)
@ (repeat)
>@key
@ (do (set [want key] t))
@ (end)
@ (end)
@(end)
@(repeat)
@path
@ (next path)
@ (repeat)
>@key
@datum
@ (require [want key])
@ (output)
>@key
@datum
@ (end)
@ (end)
@(end)
要将dir1
路径与其余路径分开,我们使用@(all)
匹配(尝试多个模式分支,必须全部匹配)和两个分支。第一个分支匹配一个@dir/@(skip)
模式,将变量dir
绑定到以斜杠开头的文本,并忽略其余部分。第二个分支通过@dir/@file
匹配整个连续的@(repeat :gap 0)
模式序列。因为出现了相同的dir
变量已经具有all
的第一个分支的绑定,所以这会将匹配限制为相同的目录名。在repeat
内,我们通过next
递归到每个文件中,并将>
分隔的键收集到keep
哈希中。之后,我们将剩余的参数作为要处理的文件的路径名处理;他们不必都在同一个目录中。我们在每个模式中扫描>@key
模式,后跟@datum
行。如果@(require ...)
不在key
哈希值中,wanted
指令将无法匹配,否则我们会转到@(output)
。