Question

我想检查文本文件中是否存在所有字符串。它们可以存在于同一条线上或不同的线上。部分匹配应该没问题。像这样：

...
string1
...
string2
...
string3
...
string1 string2
...
string1 string2 string3
...
string3 string1 string2
...
string2 string3
... and so on

在上面的例子中，我们可以用正则表达式代替字符串。

例如，以下code检查文件中是否存在任何字符串：

if grep -EFq "string1|string2|string3" file; then
  # there is at least one match
fi

如何检查所有是否存在？由于我们只对所有匹配的状态感兴趣，因此我们应该在匹配所有字符串后立即停止读取文件。

是否可以在不必多次调用grep的情况下执行此操作（当输入文件很大或者我们有大量字符串需要匹配时，它不会缩放）或者使用像{{{}这样的工具1}}或awk？

此外，是否有针对正则表达式可以轻松扩展的字符串解决方案？

Answer 1

Awk是发明grep，shell等人发明的工具，可以完成这样的一般文本处理工作，所以不确定为什么你要试图避免它。

如果简洁是您正在寻找的，那么这里的GNU awk单行程就可以满足您的要求：

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

以下是其他一些信息和选项：

假设您真的在寻找字符串，那就是：

awk -v strings='string1 string2 string3' '
BEGIN {
    numStrings = split(strings,tmp)
    for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
    for (str in strs) {
        if ( index($0,str) ) {
            delete strs[str]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file

一旦所有字符串匹配，上述内容将立即停止读取文件。

如果您正在寻找regexps而不是字符串，那么使用GNU awk进行多字符RS并在END部分保留$ 0，您可以这样做：

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

实际上，即使它是你可以做的字符串：

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

上面两个GNU awk解决方案的主要问题是，像@ anubhava的GNU grep -P解决方案一样，整个文件必须一次读入内存，而上面的第一个awk脚本，它＆＃39; ll可以在任何UNIX机器上的任何shell中的任何awk中工作，并且一次只存储一行输入。

我看到你在你的问题下添加了一条评论，说你可能有数千种模式＆＃34;。假设你的意思是＆＃34;字符串＆＃34;然后，不是将它们作为参数传递给脚本，而是可以从文件中读取它们，例如使用GNU awk进行多字符RS和每行一个搜索字符串的文件：

awk '
NR==FNR { strings[$0]; next }
{
    for (string in strings)
        if ( !index($0,string) )
            exit 1
}
' file_of_strings RS='^$' file_to_be_searched

对于正则表达式，它是：

awk '
NR==FNR { regexps[$0]; next }
{
    for (regexp in regexps)
        if ( $0 !~ regexp )
            exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

如果您没有GNU awk并且您的输入文件不包含NUL字符，则可以使用RS='\0'代替RS='^$'或附加到变量来获得与上述相同的效果一次读取一行，然后在END部分处理该变量。

如果你的file_to_be_searched太大而无法容纳在内存中，那么它就是字符串：

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
    for (string in strings) {
        if ( index($0,string) ) {
            delete strings[string]
            numStrings--
        }
    }
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

和regexps的等价物：

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
    for (regexp in regexps) {
        if ( $0 ~ regexp ) {
            delete regexps[regexp]
            numRegexps--
        }
    }
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

Answer 2

`git grep`

以下是使用git grep多种模式的语法：

git grep --all-match --no-index -l -e string1 -e string2 -e string3 file

您还可以将模式与布尔表达式结合使用，例如--and，--or和--not。

检查man git-grep寻求帮助。

--all-match在提供多个模式表达式时，此标志被指定为将匹配限制为具有匹配所有模式的行的文件。

--no-index 搜索当前目录中不受Git管理的文件。

-l / --files-with-matches / --name-only仅显示文件名称。

-e下一个参数是模式。默认是使用基本正则表达式。

要考虑的其他参数：

--threads要使用的grep工作线程数。

-q / --quiet / --silent不输出匹配的行;当匹配时退出状态为0。

要更改模式类型，您还可以使用-G / --basic-regexp（默认），-F / --fixed-strings，-E / {{1} }，--extended-regexp / -P，--perl-regexp和其他。

Answer 3

此gnu-awk脚本可能有效：

cat fileSearch.awk
re == "" {
   exit
}
{
   split($0, null, "\\<(" re "\\>)", b)
   for (i=1; i<=length(b); i++)
      gsub("\\<" b[i] "([|]|$)", "", re)
}
END {
   exit (re != "")
}

然后将其用作：

if awk -v re='string1|string2|string3' -f fileSearch.awk file; then
   echo "all strings were found"
else
   echo "all strings were not found"
fi

或者，您可以将此gnu grep解决方案与PCRE选项一起使用：

grep -qzP '(?s)(?=.*\bstring1\b)(?=.*\bstring2\b)(?=.*\bstring3\b)' file

使用-z我们使grep将完整文件读入一个字符串。
我们使用多个前瞻断言断言所有字符串都存在于文件中。
正则表达式必须使用(?s)或DOTALL mod才能使.*匹配。

根据man grep：

-z, --null-data
   Treat  input  and  output  data as sequences of lines, each terminated by a 
   zero byte (the ASCII NUL character) instead of a newline.

Answer 4

首先，您可能想要使用awk。由于您在问题陈述中删除了该选项，是的，可以这样做，这提供了一种方法。它可能比使用awk慢很多，但如果你想要这样做......

这是基于以下假设：G

调用AWK是不可接受的
多次调用grep是不可接受的
使用任何其他外部工具是不可接受的
不到一次调用grep是可以接受的
如果找到所有内容，必须返回成功，否则失败

bash

bash版本对于正则表达式版本

这可能符合您的所有要求:(正则表达式版本错过了一些注释，请查看字符串版本）

#!/bin/bash

multimatch() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    strings=( "$@" ) # search strings into an array

    declare -a matches # Array to keep track which strings already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#strings[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                string="${strings[$i]}" # fetch the string
                if [[ $line = *$string* ]]; then # check if it matches
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

multimatch_regex() {
    filename="$1" # Filename is first parameter
    shift # move it out of the way that "$@" is useful
    regexes=( "$@" ) # Regexes into an array

    declare -a matches # Array to keep track which regexes already match

    # Initiate array tracking what we have matches for
    for ((i=0;i<${#regexes[@]};i++)); do
        matches[$i]=0
    done

    while IFS= read -r line; do # Read file linewise
        foundmatch=0 # Flag to indicate whether this line matched anything
        for ((i=0;i<${#strings[@]};i++)); do # Loop through strings indexes
            if [ "${matches[$i]}" -eq 0 ]; then # If no previous line matched this string yet
                regex="${regexes[$i]}" # Get regex from array
                if [[ $line =~ $regex ]]; then # We use the bash regex operator here
                    matches[$i]=1   # mark that we have found this
                    foundmatch=1    # set the flag, we need to check whether we have something left
                fi
            fi
        done
        # If we found something, we need to check whether we
        # can stop looking
        if [ "$foundmatch" -eq 1 ]; then
            somethingleft=0 # Flag to see if we still have unmatched strings
            for ((i=0;i<${#matches[@]};i++)); do
                if [ "${matches[$i]}" -eq 0 ]; then
                    somethingleft=1 # Something is still outstanding
                    break # no need check whether more strings are outstanding
                fi
            done
            # If we didn't find anything unmatched, we have everything
            if [ "$somethingleft" -eq 0 ]; then return 0; fi
        fi
    done < "$filename"

    # If we get here, we didn't have everything in the file
    return 1
}

if multimatch "filename" string1 string2 string3; then
    echo "file has all strings"
else
    echo "file miss one or more strings"
fi

if multimatch_regex "filename" "regex1" "regex2" "regex3"; then
    echo "file match all regular expressions"
else
    echo "file does not match all regular expressions"
fi

基准

我在Linux 4.16.2中对arch / arm /中的.c，.h和.sh进行了一些基准测试，搜索字符串“void”，“function”和“#define” ”。（添加了Shell包装器/调整了所有可以调用testname <filename> <searchstring> [...]的代码，并且可以使用if来检查结果）

结果:(以time衡量，real时间四舍五入到最接近的半秒）

multimatch：49s
multimatch_regex：55s
matchall：10.5s
fileMatchesAllNames：4s
awk（第一版）：4s
agrep：4.5s
Perl re（-r）：10.5s
Perl non-re：9.5s
Perl non-re optimised：5s（删除了Getopt :: Std和正则表达式支持更快启动）
Perl re optimised：7s（删除了Getopt :: Std和非正则表达式支持以加快启动速度）
git grep：3.5s
C version（无正则表达式）：1.5s

（多次调用grep，特别是使用递归方法，比我预期的更好）

Answer 5

递归解决方案。逐个迭代文件。对于每个文件，检查它是否与第一个模式匹配并提前中断（-m1：在第一个匹配时），仅当它与第一个模式匹配时，搜索第二个模式，依此类推：

#!/bin/bash

patterns="$@"

fileMatchesAllNames () {
  file=$1
  if [[ $# -eq 1 ]]
  then
    echo "$file"
  else
    shift
    pattern=$1
    shift
    grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
  fi
}

for file in *
do
  test -f "$file" && fileMatchesAllNames "$file" $patterns
done

用法：

./allfilter.sh cat filter java
test.sh

在当前目录中搜索令牌＆＃34; cat＆＃34;，＆＃34;过滤＆＃34;和＆＃34; java＆＃34;。仅在＆＃34; test.sh＆＃34;。

中找到它们

因此，在最坏的情况下经常调用grep（在每个文件的最后一行中找到第一个N-1模式，除了第N个模式）。

但是如果可能的话，通过明智的排序（首先是匹配，首先是早期匹配），解决方案应该是合理的，因为许多文件因为与第一个关键字不匹配或早期接受而被提前放弃，因为他们匹配靠近顶部的关键字。

示例：您搜索scala源文件，其中包含tailrec（有点很少使用），可变（很少使用，但如果是这样，接近导入语句的顶部）main（很少使用，通常不靠近顶部）和println（经常使用，不可预测的位置），你会订购它们：

./allfilter.sh mutable tailrec main println

性能：

ls *.scala | wc 
 89      89    2030

在89个scala文件中，我有关键字分布：

for keyword in mutable tailrec main println; do grep -m 1 $keyword *.scala | wc -l ; done 
16
34
41
71

使用略微修改的脚本版本搜索它们，允许使用filepattern作为第一个参数需要大约0.2s：

time ./allfilter.sh "*.scala" mutable tailrec main println
Filepattern: *.scala    Patterns: mutable tailrec main println
aoc21-2017-12-22_00:16:21.scala
aoc25.scala
CondenseString.scala
Partition.scala
StringCondense.scala

real    0m0.216s
user    0m0.024s
sys 0m0.028s

接近15.000个代码行：

cat *.scala | wc 
  14913   81614  610893

更新

在阅读了对问题的评论后，我们可能会谈论模式的问题，将它们作为论据处理似乎并不是一个聪明的主意;最好从文件中读取它们，并将文件名作为参数传递 - 也许是要过滤的文件列表：

#!/bin/bash

filelist="$1"
patternfile="$2"
patterns="$(< $patternfile)"

fileMatchesAllNames () {
  file=$1
  if [[ $# -eq 1 ]]
  then
    echo "$file"
  else
    shift
    pattern=$1
    shift
    grep -m1 -q "$pattern" "$file" && fileMatchesAllNames "$file" $@
  fi
}

echo -e "Filepattern: $filepattern\tPatterns: $patterns"
for file in $(< $filelist)
do
  test -f "$file" && fileMatchesAllNames "$file" $patterns
done

如果模式/文件的数量和长度超出了参数传递的可能性，模式列表可以分成许多模式文件并在循环中处理（例如20个模式文件）：

for i in {1..20}
do
   ./allfilter2.sh file.$i.lst pattern.$i.lst > file.$((i+1)).lst
done

Answer 6

你可以

使用-o的{{1}} | --only-matching选项（强制仅输出匹配行的匹配部分，每个此类部分位于单独的输出上线），
然后使用grep，
最后检查剩余行数是否等于输入字符串的数量。

演示：

sort -u

此解决方案的一个缺点（未能满足部分匹配应该是正确的要求）是$ cat input ... string1 ... string2 ... string3 ... string1 string2 ... string1 string2 string3 ... string3 string1 string2 ... string2 string3 ... and so on $ grep -o -F $'string1\nstring2\nstring3' input|sort -u|wc -l 3 $ grep -o -F $'string1\nstring3' input|sort -u|wc -l 2 $ grep -o -F $'string1\nstring2\nfoo' input|sort -u|wc -l 2没有检测到重叠匹配。例如，虽然文字 grep 与 abcd 和 abc 相匹配，但{{1} }只找到其中一个：

bcd

请注意，此方法/解决方案仅适用于固定字符串。它不能扩展为正则表达式，因为单个正则表达式可以匹配多个不同的字符串，我们无法跟踪哪个匹配对应于哪个正则表达式。您可以做的最好的事情是将匹配项存储在临时文件中，然后一次使用一个正则表达式多次运行grep。

作为bash脚本实现的解决方案：

<强> matchall ：

$ grep -o -F $'abc\nbcd' <<< abcd
abc

$ grep -o -F $'bcd\nabc' <<< abcd
abc

演示：

grep

Answer 7

检查文件是否包含所有三种模式的最简单方法是仅获取匹配的模式，仅输出唯一的部分和计数行。然后，您可以使用简单的测试条件：test 3 -eq $grep_lines进行检查。

 grep_lines=$(grep -Eo 'string1|string2|string3' file | uniq | wc -l)

关于第二个问题，我认为一旦找到多个模式，就不会停止阅读该文件。我已经阅读了grep的手册页，没有任何选项可以帮助你。您只能在具有选项grep -m [number]的特定行之后停止读取行，无论匹配模式如何都会发生。

非常确定为此目的需要自定义功能。

Answer 8

这是一个有趣的问题，在grep手册页中没有任何明显的建议可以提供简单的答案。可能有一个疯狂的正则表达式可以做到这一点，但可能更清楚的是直接的greps链，即使最终扫描文件n次。至少-q选项每次在第一场比赛时保释，并且＆amp;＆amp;如果找不到其中一个字符串，将快捷方式评估。

$grep -Fq string1 t && grep -Fq string2 t && grep -Fq string3 t
$echo $?
0

$grep -Fq string1 t && grep -Fq blah t && grep -Fq string3 t
$echo $?
1

Answer 9

perl -lne '%m = (%m, map {$_ => 1} m!\b(string1|string2|string3)\b!g); END { print scalar keys %m == 3 ? "Match": "No Match"}' file

Answer 10

忽略“是否可以在没有......或使用awk或python等工具的情况下执行此操作？”要求，您可以使用Perl脚本：

（为您的系统使用适当的shebang或类似/bin/env perl）

#!/usr/bin/perl

use Getopt::Std; # option parsing

my %opts;
my $filename;
my @patterns;
getopts('rf:',\%opts); # Allowing -f <filename> and -r to enable regex processing

if ($opts{'f'}) { # if -f is given
    $filename = $opts{'f'};
    @patterns = @ARGV[0 .. $#ARGV]; # Use everything else as patterns
} else { # Otherwise
    $filename = $ARGV[0]; # First parameter is filename
    @patterns = @ARGV[1 .. $#ARGV]; # Rest is patterns
}
my $use_re= $opts{'r'}; # Flag on whether patterns are regex or not

open(INF,'<',$filename) or die("Can't open input file '$filename'");


while (my $line = <INF>) {
    my @removal_list = (); # List of stuff that matched that we don't want to check again
    for (my $i=0;$i <= $#patterns;$i++) {
        my $pattern = $patterns[$i];
        if (($use_re&& $line =~ /$pattern/) || # regex match
            (!$use_re&& index($line,$pattern) >= 0)) { # or string search
            push(@removal_list,$i); # Mark to be removed
        }
    }
    # Now remove everything we found this time
    # We need to work backwards to keep us from messing
    # with the list while we're busy
    for (my $i=$#removal_list;$i >= 0;$i--) {
        splice(@patterns,$removal_list[$i],1);
    }
    if (scalar(@patterns) == 0) { # If we don't need to match anything anymore
        close(INF) or warn("Error closing '$filename'");
        exit(0); # We found everything
    }
}
# End of file

close(INF) or die("Error closing '$filename'");
exit(1); # If we reach this, we haven't matched everything

保存为matcher.pl，这将搜索纯文本字符串：

./matcher filename string1 string2 string3 'complex string'

这将搜索正则表达式：

./matcher -r filename regex1 'regex2' 'regex4'

（文件名可以用-f代替）：

./matcher -f filename -r string1 string2 string3 'complex string'

仅限于单行匹配模式（由于按行处理文件）。

从shell脚本调用大量文件时，性能比awk慢（但搜索模式可以包含空格，与-v中空格分隔的awk不同}}）。如果转换为函数并从Perl代码调用（包含要搜索的文件列表的文件），它应该比大多数awk实现快得多。（当调用几个小文件时，perl启动时间（脚本的解析等）占主导地位）

通过硬编码可以显着加快是否使用正则表达式，但代价是灵活性。（请参阅我的benchmarks here，了解删除Getopt::Std的效果）

Answer 11

也许与gnu sed

cat match_word.sh

sed -z '
  /\b'"$2"'/!bA
  /\b'"$3"'/!bA
  /\b'"$4"'/!bA
  /\b'"$5"'/!bA
  s/.*/0\n/
  q
  :A
  s/.*/1\n/
' "$1"

你这样称呼它：

./match_word.sh infile string1 string2 string3

如果找到所有匹配则返回0，否则返回0

在这里你可以找到4个字符串

如果你想要更多，你可以添加像

这样的行

/\b'"$x"'/!bA

Answer 12

只是为了解决方案的完整性＆＃34;，您可以使用不同的工具并避免多个greps和awk / sed或大（可能很慢）的shell循环;这样的工具是agrep。

agrep实际上是一种egrep支持模式之间的and操作，使用;作为模式分隔符。

与egrep类似，与大多数众所周知的工具一样，agrep是一个对记录/行进行操作的工具，因此我们仍然需要一种方法将整个文件视为单个记录。登记/> 此外，agrep提供-d选项来设置自定义记录分隔符。

一些测试：

$ cat file6
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3

$ agrep -d '$$\n' 'str3;str2;str1;str4' file6;echo $?
str4
str1
str2
str3
str1 str2
str1 str2 str3
str3 str1 str2
str2 str3
0

$ agrep -d '$$\n' 'str3;str2;str1;str4;str5' file6;echo $?
1

$ agrep -p 'str3;str2;str1' file6  #-p prints lines containing all three patterns in any position
str1 str2 str3
str3 str1 str2

没有工具是完美的，agrep也有一些限制;你不能使用超过32个字符的正则表达式/模式，并且当与正则表达式一起使用时，某些选项不可用 - 所有这些都在agrep man page中解释

Answer 13

假设您要检查的所有字符串都在文件strings.txt中，并且您要签入的文件是input.txt，则以下一个内容将执行：

根据评论更新了答案：

$ diff <( sort -u strings.txt )  <( grep -o -f strings.txt input.txt | sort -u )

说明：

使用grep的-o选项仅匹配您感兴趣的字符串。这将提供文件input.txt中存在的所有字符串。然后使用diff来获取未找到的字符串。如果找到所有字符串，结果将一无所获。或者，只需检查差异的退出代码。

不做的事情：

找到所有匹配后立即退出。
可扩展至regx。
重叠比赛。

它做了什么：

查找所有比赛。
单打电话给grep。
不使用awk或python。

Answer 14

在python中使用fileinput module允许在命令行上指定文件或从stdin逐行读取文本。你可以将字符串硬编码到python列表中。

# Strings to match, must be valid regular expression patterns
# or be escaped when compiled into regex below.
strings = (
    r'string1',
    r'string2',
    r'string3',
)

或从另一个文件中读取字符串

import re
from fileinput import input, filename, nextfile, isfirstline

for line in input():
    if isfirstline():
        regexs = map(re.compile, strings) # new file, reload all strings

    # keep only strings that have not been seen in this file
    regexs = [rx for rx in regexs if not rx.match(line)] 

    if not regexs: # found all strings
        print filename()
        nextfile()

Answer 15

他们中的许多答案都很好。

但是，如果性能是一个问题 - 如果输入很大并且您有数千种模式肯定是可能的 - 那么您将获得大使用像lex或flex这样的工具加速生成真正的确定性有限自动机作为识别器，而不是每个模式调用一次正则表达式解释器。

有限自动机将为每个输入字符执行一些机器指令，而不管模式的数量。

简洁的灵活解决方案：

%{
void match(int);
%}
%option noyywrap

%%

"abc"       match(0);
"ABC"       match(1);
[0-9]+      match(2);
/* Continue adding regex and exact string patterns... */

[ \t\n]     /* Do nothing with whitespace. */
.   /* Do nothing with unknown characters. */

%%

// Total number of patterns.
#define N_PATTERNS 3

int n_matches = 0;
int counts[10000];

void match(int n) {
  if (counts[n]++ == 0 && ++n_matches == N_PATTERNS) {
    printf("All matched!\n");
    exit(0);
  }
}

int main(void) {
  yyin = stdin;
  yylex();
  printf("Only matched %d patterns.\n", n_matches);
  return 1;
}

缺点是你必须为每一组模式构建它。那还不错：

flex matcher.y
gcc -O lex.yy.c -o matcher

现在运行它：

./matcher < input.txt

Answer 16

对于普通速度，没有外部工具限制，没有正则表达式，这个（原始）C版本做得不错。（可能只有Linux，虽然它应该适用于所有类似Unix的系统mmap）

#include <sys/mman.h>
#include <sys/stat.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

/* https://stackoverflow.com/a/8584708/1837991 */
inline char *sstrstr(char *haystack, char *needle, size_t length)
{
    size_t needle_length = strlen(needle);
    size_t i;
    for (i = 0; i < length; i++) {
        if (i + needle_length > length) {
            return NULL;
        }
        if (strncmp(&haystack[i], needle, needle_length) == 0) {
            return &haystack[i];
        }
    }
    return NULL;
}

int matcher(char * filename, char ** strings, unsigned int str_count)
{
    int fd;
    struct stat sb;
    char *addr;
    unsigned int i = 0; /* Used to keep us from running of the end of strings into SIGSEGV */

    fd = open(filename, O_RDONLY);
    if (fd == -1) {
        fprintf(stderr,"Error '%s' with open on '%s'\n",strerror(errno),filename);
        return 2;
    }

    if (fstat(fd, &sb) == -1) {          /* To obtain file size */
        fprintf(stderr,"Error '%s' with fstat on '%s'\n",strerror(errno),filename);
        close(fd);
        return 2;
    }

    if (sb.st_size <= 0) { /* zero byte file */
        close(fd);
        return 1; /* 0 byte files don't match anything */
    }

    /* mmap the file. */
    addr = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (addr == MAP_FAILED) {
        fprintf(stderr,"Error '%s' with mmap on '%s'\n",strerror(errno),filename);
        close(fd);
        return 2;
    }

    while (i++ < str_count) {
        char * found = sstrstr(addr,strings[0],sb.st_size);
        if (found == NULL) {  /* If we haven't found this string, we can't find all of them */
            munmap(addr, sb.st_size);
            close(fd);
            return 1; /* so give the user an error */
        }
        strings++;
    }
    munmap(addr, sb.st_size);
    close(fd);
    return 0; /* if we get here, we found everything */
}

int main(int argc, char *argv[])
{
    char *filename;
    char **strings;
    unsigned int str_count;
    if (argc < 3) { /* Lets count parameters at least... */
        fprintf(stderr,"%i is not enough parameters!\n",argc);
        return 2;
    }
    filename = argv[1]; /* First parameter is filename */
    strings = argv + 2; /* Search strings start from 3rd parameter */
    str_count = argc - 2; /* strings are two ($0 and filename) less than argc */

    return matcher(filename,strings,str_count);
}

用以下内容编译：

gcc matcher.c -o matcher

使用以下命令运行：

./matcher filename needle1 needle2 needle3

致谢：

使用sstrstr
文件处理主要是从mmap man page

注意：

它会多次扫描匹配字符串前面的文件部分 - 它只会打开文件一次。
整个文件可能最终被加载到内存中，特别是如果字符串不匹配，操作系统需要决定
可以使用POSIX regex library添加正则表达式支持（性能可能比grep略好一点 - 它应该基于相同的库，只需打开文件一次进行搜索就可以减少开销多个正则表达式）
包含空值的文件应该可以使用，但不是用它们搜索字符串......
除null之外的所有字符都应该是可搜索的（\ r，\ n等）

Answer 17

以下python脚本应该可以解决问题。对于每一行，它有多次调用等价的grep（re.search） - 即它搜索每一行的每个模式，但由于你不是每次都要求一个进程，所以应该更有效率。此外，它还会删除已找到的模式，并在找到所有模式后停止。

#!/usr/bin/env python

import re

# the file to search
filename = '/path/to/your/file.txt'

# list of patterns -- can be read from a file or command line 
# depending on the count
patterns = [r'py.*$', r'\s+open\s+', r'^import\s+']
patterns = map(re.compile, patterns)

with open(filename) as f:
    for line in f:
        # search for pattern matches
        results = map(lambda x: x.search(line), patterns)

        # remove the patterns that did match
        results = zip(results, patterns)
        results = filter(lambda x: x[0] == None, results)
        patterns = map(lambda x: x[1], results)

        # stop if no more patterns are left
        if len(patterns) == 0:
            break

# print the patterns which were not found
for p in patterns:
    print p.pattern

如果处理普通（非正则表达式）字符串，则可以为普通字符串（string in line）添加单独的检查 - 效率稍高。

这能解决您的问题吗？

Answer 18

另一个Perl变体-只要所有给定的字符串都匹配..即使半读文件，处理也将完成并仅打印结果

> perl -lne ' /\b(string1|string2|string3)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}'  all_match.txt
Match
> perl -lne ' /\b(string1|string2|stringx)\b/ and $m{$1}++; eof if keys %m == 3; END { print keys %m == 3 ? "Match": "No Match"}'  all_match.txt
No Match

Answer 19

先删除行分隔符，然后多次使用普通grep，模式数如下。

示例：让文件内容如下

PAT1
PAT2
PAT3
something
somethingelse

cat file | tr -d "\n" | grep "PAT1" | grep "PAT2" | grep -c "PAT3"

Answer 20

我没有在答案中看到一个简单的计数器，所以这里是一个使用awk的面向对象的解决方案，只要满足所有匹配就会停止：

/string1/ { a = 1 }
/string2/ { b = 1 }
/string3/ { c = 1 }
{
    if (c + a + b == 3) {
        print "Found!";
        exit;
    }
}

通用脚本

通过shell参数扩展用法：

#! /bin/sh
awk -v vars="$*" -v argc=$# '
BEGIN { split(vars, args); }
{
    for (arg in args) {
        if (!temp[arg] && $0 ~ args[arg]) {
            inc++;
            temp[arg] = 1;
        }
    }

    if (inc == argc) {
        print "Found!";
        exit;
    }
}
END { exit 1; }
' filename

用法（可以传递正则表达式）：

./script "str1?" "(wo)?men" str3

或应用一串模式：

./script "str1? (wo)?men str3"

Answer 21

$ cat allstringsfile | tr '\n' ' ' |  awk -f awkpattern1

其中allstringsfile是您的文本文件，如原始问题中所示。 awkpattern1包含字符串模式，＆amp;＆amp;条件：

$ cat awkpattern1
/string1/ && /string2/ && /string3/

检查文件中是否存在多个字符串或正则表达式

21 个答案:

`git grep`

基准

用法：

性能：

更新

通用脚本