Question

我有一个名为words.txt的文件，其中包含单词列表。我还有一个名为file.txt的文件，每行包含一个句子。我需要快速删除file.txt中包含words.txt之一行的所有行，但仅限于{和}之间的匹配项。

E.g。 file.txt：

Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.

E.g。 words.txt：

cat
mice

示例输出：

Once upon a time there was a cat.

被移除因为＆＃34; cat＆＃34;在这两行中找到，并且单词也在{和}之间。

以下脚本成功完成此任务：

while read -r line
do
    sed -i "/{.*$line.*}/d" file.txt
done < words.txt

这个脚本非常慢。有时words.txt包含数千个项目，因此while循环需要几分钟。我试图使用sed -f选项，这似乎允许读取文件，但我找不到任何解释如何使用它的手册。

如何提高脚本的速度？

Answer 1

awk解决方案：

awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt

它直接转换file.txt以获得预期的输出。

Once upon a time there was a cat.

未收缩版本：

awk '
    NR == FNR {
        a["{[^{}]*" $0 "[^{}]*}"]++
        next
    }
    {
        for (i in a)
            if ($0 ~ i)
                next
        b[j++] = $0
    }
    END {
        printf "" > FILENAME
        for (i = 0; i in b; ++i)
            print b[i] > FILENAME
    }
' words.txt file.txt

如果预期文件太大而awk可能无法处理它，我们只能将其重定向到stdout。我们可能无法直接修改文件：

awk '
    NR == FNR {
        a["{[^{}]*" $0 "[^{}]*}"]++
        next
    }
    {
        for (i in a)
            if ($0 ~ i)
                next
    }
    1
' words.txt file.txt

Answer 2

您可以使用grep匹配2个文件，如下所示：

grep -vf words.txt file.txt

Answer 3

认为使用grep命令应该更快。例如：

grep -f words.txt -v file.txt

f选项使grep使用words.txt文件作为匹配模式
v选项反转匹配，即保留与其中一个模式不匹配的文件。

它没有解决{}约束，但这很容易避免，例如通过将括号添加到模式文件（或在运行时创建的临时文件中）。

Answer 4

我认为这应该适合你：

sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt

这基本上只是动态修改words.txt文件，并将其用作grep的word文件。

Answer 5

您可以分两步完成此操作：

使用words.txt和{.*将.*}中的每个字换行：

awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt

使用grep反向匹配：
```
grep -v -f wrapped.txt file.txt
```

如果words.txt非常大，这将特别有用，因为纯awk方法（将words.txt的所有条目存储在数组中）将需要大量内存。

如果您更喜欢单行并且想跳过创建中间文件，您可以这样做：

awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt

-是占位符，告诉grep使用stdin

更新

如果words.txt的大小不是太大，你可以在awk中完成整个事情：

awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt

展开：

awk 'NR==FNR { a[$0]++; next }
     { 
         p=1
         for (i in a) {
             if ($0 ~ "{.*" i ".*}") { p=0; break }
         }
     }p' words.txt file.txt

第一个块构建一个包含words.txt中每一行的数组。第二个块为file.txt中的每一行运行。标志p控制是否打印该行。如果该行与模式匹配，则p设置为false。当最后一个块之外的p计算结果为true时，将发生默认操作，即打印该行。

Answer 6

纯粹的原生bash（4.x）：

#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh

readarray -t words <words.txt          # read words into array
IFS='|'                                # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]"     # form a regex matching all words
while read -r; do                      # for each line in file...
  if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
    printf '%s\n' "$REPLY"             # ...and print it if not.
  fi
done <file.txt

原生bash比awk稍慢，但这仍然是单遍解决方案（O(n+m)，而sed -i方法是O(n*m)），使其大大< / em>比任何迭代方法都快。

如何快速删除包含BASH中另一个文件列表中项目的文件中的行？

6 个答案:

更新