Question

我有一个大文件，想从文件中删除包含另一个文件中列出的确切字符串的所有行。但是，字符串必须完全匹配（很抱歉，我不知道如何更好地描述它）。

这是文件：

one@email.com,name,surname,city,state
two@email.com,name,surname,city,state
three@email.com,name,surname,city,state
anotherone@email.com,name,surname,city,state

这是要过滤的示例列表：

one@email.com
three@email.com

所需的输出是：

two@email.com,name,surname,city,state
anotherone@email.com,name,surname,city,state

我尝试使用以下方法执行此操作：

grep -v -f 2.txt 1.txt > 3.txt

但是这会产生输出：

two@email.com,name,surname,city,state

我认为这样做是因为“ anotherone@email.com”包含“ one@email.com”。我正在寻找一种方法来包括该行的开头，但没有找到合适的方法。

我也愿意做grep以外的其他事情，我使用grep是因为我无法以其他任何方式解决它。

Answer 1

假设您的输入文件中包含three@gmail.com而不是three@email.com（可能是打字错误）

$ grep -vw -f 2.txt 1.txt
two@email.com,name,surname,city,state
anotherone@email.com,name,surname,city,state

-w，--word-regexp - 将该表达式作为单词进行搜索（就像被[[:<:]]' and [[:::]]';
包围一样）

Answer 2

如果您只想打印第一个文件中的行，其中does not在第一个字段中包含第二个文件中的数据，则应该这样做：

$cat file
one@email.com,name,surname,city,state
two@email.com,name,surname,city,state
three@email.com,name,surname,city,state
anotherone@email.com,name,surname,city,state
$cat filter
one@email.com
three@email.com

awk -F, 'NR==FNR {a[$0]++;next} !($1 in a)' filter file
two@email.com,name,surname,city,state
anotherone@email.com,name,surname,city,state

对于filter中的每一行，这将创建一个数组a，其名称和值为1
像a[one@email.com]=1和a[three@email.com]=1
然后awk在file中逐行测试数组，得出

a[one@email.com]=1
a[two@email.com]=
a[three@email.com]=1
a[anotherone@email.com]=

然后从file打印所有行，而不打印1

two@email.com,name,surname,city,state
anotherone@email.com,name,surname,city,state

Answer 3

对于这种特殊情况-通过建立一个以过滤线作为索引的关联数组来处理第一个文件。在随后的文件中，测试给定的行是否不在数组索引中-模式的默认操作是打印。

awk -F, -v OFS=, '
    BEGIN   { split("", m) }
    NR==FNR { m[$0] = ""; next }
    !($1 in m)
' filter.txt file.txt

但是...如果我们要过滤在行中任何位置出现的任何字符串（无限制的精确匹配），我们需要做一些不太聪明和更暴力的事情：

awk '
    BEGIN {
        split("", m)
        n=0
    }
    NR==FNR {
        m[n++] = $0
        next
    }
    {
        for (i=0; i<n; ++i) {
            if (index($0, m[i]))
                next
        }
        print
    }
' filter.txt file.txt

请注意，如果过滤器包含不可打印的字符（例如，非unix的行尾），我们需要通过过滤掉它们（例如，使用sub(/\r/, "")）来处理它们。

如果它包含另一个文件的行中的确切字符串，请删除该行

3 个答案: