Question

我想使用包含正则表达式的文件中的模式进行grep。模式匹配时，将打印匹配的字符串，但不打印模式。如何获取模式而不是匹配的字符串？

pattern.txt

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
Donut Gorilla Chocolate
Chocolate (English|Fall) apple gorilla
gorilla chocolate (apple|ball)
(ball|donut) apple

strings.txt

apple ball Donut
donut ball chocolate
donut Ball Chocolate
apple donut
chocolate ball Apple

这是grep命令

grep -Eix -f pattern.txt strings.txt

此命令从strings.txt打印匹配的字符串

apple ball Donut
donut ball chocolate
donut Ball Chocolate

但是我想从pattern.txt中找到用于匹配的模式

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

pattern.txt可以是小写，大写，带正则表达式的行和不带正则表达式的行，自由行的单词和正则表达式元素。除了方括号和管道，没有其他正则表达式。

我不想使用循环读取grep的每一行，因为它速度很慢。是否可以在grep命令中打印模式文件的哪个模式或行号？还是grep以外的任何其他命令可以使这项工作不太慢？

Answer 1

使用grep除了GNU awk，我不知道：

$ awk '
BEGIN { IGNORECASE = 1 }      # for case insensitivity
NR==FNR {                     # process pattern file
    a[$0]                     # hash the entries to a
    next                      # process next line
}
{                             # process strings file
    for(i in a)               # loop all pattern file entries
        if($0 ~ "^" i "$") {  # if there is a match (see comments)
            print i           # output the matching pattern file entry
            # delete a[i]     # uncomment to delete matched patterns from a
            # next            # uncomment to end searching after first match
        }
}' pattern strings

输出：

D (A|B) C

对于strings脚本中的每一行，都会循环pattern的每一行，以查看是否存在多个匹配项。由于区分大小写，只有一个匹配项。例如，您可以使用GNU awk的IGNORECASE与之抗争。

此外，如果希望每个匹配的模式文件条目输出一次，则可以在第一次匹配后将它们从a中删除：在delete a[i]之后添加print。这也可能会给您带来一些性能优势。

Answer 2

编辑： ：由于OP更改了输入文件，因此现在也要根据更改的输入文件添加解决方案。

awk '
FNR==NR{
   a[toupper($1),toupper($NF)]
   b[toupper($2)]
   next
}
{
   val=toupper($2)
   gsub(/\)|\(|\|/," ",val)
   num=split(val,array," ")
   for(i=1;i<=num;i++){
      if(array[i] in b){
        flag=1
        break
      }
   }
}
flag && ((toupper($1),toupper($NF)) in a){
  print;
  flag=""
}' string pattern

输出如下。

Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

解决方案第一： ：添加一个通用解决方案，假设您名为pattern的Input_file在第二个字段上具有两个以上的值，例如-> {{1 }}，那么以下内容可能会对您有所帮助。

(B|C|D|E)

解决方案2： ，请您尝试以下操作。但是请严格考虑您的Input_file与仅显示的示例具有相同的模式（在这里，我认为名为awk ' FNR==NR{ a[$1,$NF] b[toupper($2)] next } { val=$2 gsub(/\)|\(|\|/," ",val) num=split(val,array," ") for(i=1;i<=num;i++){ if(array[i] in b){ flag=1 break } } } flag && (($1,$NF) in a) { flag="" }' string pattern的Input_file在其第二字段中将只有2个值）

pattern

输出如下。

awk '
FNR==NR{
  a[$1,$NF]
  b[toupper($2)]
  next
}
{
  val=$2
  gsub(/\)|\(|\|/," ",val)
  split(val,array," ")
}
((array[1] in b) || (array[2] in b)) && (($1,$NF) in a)
' string pattern

Answer 3

您可以尝试使用内置的bash：

$ cat foo.sh
#!/usr/bin/env bash

# case insensitive
shopt -s nocasematch

# associative array of patterns
declare -A patterns=()
while read -r p; do
    patterns["$p"]=1
done < pattern.txt

# read strings, test remaining patterns,
# if match print pattern and remove it from array    
while read -r s; do
    for p in "${!patterns[@]}"; do
        if [[ $s =~ ^$p$ ]]; then
            printf "%s\n" "$p"
            unset patterns["$p"]
        fi
    done
done < strings.txt
$ ./foo.sh
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate

不确定性能，但是由于没有子进程，因此它应该比为每个模式调用grep快得多。

当然，如果您有数百万个模式，将它们存储在关联数组中可能会耗尽可用内存。

Answer 4

也许会切换范例？

while read pat
do grep -Eix "$pat" strings.txt >"$pat" &
done <patterns.txt

这会产生难看的文件名，但是每组都有清晰的列表。如果愿意，可以先清理文件名。也许（假设模式很容易就能解析为唯一性...）

while read pat
do grep -Eix "$pat" strings.txt >"${pat//[^A-Z]/}" &
done <patterns.txt

它应该相当快，并且实现起来相对简单。希望有帮助。

文件中的grep模式，打印模式而不是匹配的字符串

4 个答案: