Question

我有一个input.txt文件，其中的行代表一些命令，每个命令都有两个输入参数：

commands a b 
commands a c
commands b c 
...

我想删除文件夹out中匹配（输出文件）的所有行。例如，假设只存在out/a_b_out和out/b_c_out个文件。然后我想从input.txt中删除第一行和第三行。

此外，out中可能有数百万个文件，因此我需要一种有效的方法来查找匹配项。另一方面，input中的行数大约为数千，更易于管理。

我试图首先从输入文件中提取模式（例如cut -d " " -f 2-3 input.txt | sed -e 's/\ /_/g'），然后循环遍历这些条目并使用grep等。

我想知道是否有更快更优雅的方式来执行此操作。谢谢！

Answer 1

这可能适合您的情况

while read c x y; 
do [ -f "out/${x}_${y}_out" ] || echo "$c" "$x" "$y" 
done < input

将迭代较短的输入文件并根据现有文件过滤行;输出将是找不到文件的命令。如果输入文件格式不正确，则可能需要加强读取命令。

Answer 2

^{除非您需要awk进行其他处理，或者您需要保留输入行的空格，请考虑karakfa's helpful shell-only solution。}

awk解决方案：

鉴于out/中可能有数百万个文件，构建文件名索引不是一个选项，但您可以按照shell来测试文件的存在。

这将很慢，因为为每个输入行创建了一个sh子进程，但是输入大约有几千行可以接受：

awk '{ fpath = "out/" $2 "_" $3 "_out"; if (1 == system("[ -f '" fpath "' ]")) print }' \
  input.txt > input.tmp.$$.txt && mv input.tmp.$$.txt input.txt

Answer 3

使用awk（如果awk在游戏中）看到这个小测试相反（仅用于测试）：

$ cat file3
commands a b 
commands a c
commands b c

$ ls -l *_out
-rw-r--r-- 1 root root 0 Mar 15 04:02 a_b_out
-rw-r--r-- 1 root root 0 Mar 15 04:05 b_c_out

$ awk 'NR==FNR{a[$2 "_" $3 "_out"]=$0;next}($0 in a){print a[$0]}' file3 <(find . -maxdepth 1 -type f -printf %f\\n)
commands b c
commands a b

这意味着这个反向命令应该为您提供所需的结果：

$ awk 'NR==FNR{a[$2 "_" $3 "_out"]=$0;next}(!($0 in a)){print a[$0]}' inuutfile <(find . -maxdepth 1 -type f -printf %f\\n) >newfile

您可以删除maxdepth 1以进入所有子目录。

此解决方案基于小输入文件构建索引，而不是基于可能存在的数百万个文件;因此预计性能足够好。

将不匹配的结果发送到新文件将比连续覆盖现有文件快得多。

完成后，您可以将newfile移到oldfile上（mv newfile inputfile）

从文件中删除包含与文件夹中的文件名匹配的字符串的行

3 个答案: