Question

如何使用Bash计算字符串中子字符串的出现次数？

示例：

我想知道这个子串的次数......

Bluetooth
         Soft blocked: no
         Hard blocked: no

...出现在这个字符串中......

0: asus-wlan: Wireless LAN
         Soft blocked: no
         Hard blocked: no
1: asus-bluetooth: Bluetooth
         Soft blocked: no
         Hard blocked: no
2: phy0: Wireless LAN
         Soft blocked: no
         Hard blocked: no
113: hci0: Bluetooth
         Soft blocked: no
         Hard blocked: no

注意I：我尝试了几种sed，grep，awk的方法......当我们有空格和多行的字符串时，似乎没什么用。

注意II：我是Linux用户，我正在尝试一种不涉及在Linux发行版中常见的应用程序/工具之外安装应用程序/工具的解决方案。

重要：

除了我的问题，根据下面的假设示例，可以有一些东西。在这种情况下，我们使用两个Shell变量（Bash）而不是使用文件。

示例:(基于@Ed Morton贡献）

STRING="0: asus-wlan: Wireless LAN
         Soft blocked: no
         Hard blocked: no
1: asus-bluetooth: Bluetooth
         Soft blocked: no
         Hard blocked: no
2: phy0: Wireless LAN
         Soft blocked: no
         Hard blocked: no
113: hci0: Bluetooth
         Soft blocked: no
         Hard blocked: no"

SUB_STRING="Bluetooth
         Soft blocked: no
         Hard blocked: no"

awk -v RS='\0' 'NR==FNR{str=$0; next} {print gsub(str,"")}' "$STRING" "$SUB_STRING"

Answer 1

使用GNU awk：

$ awk '
BEGIN { RS="[0-9]+:" }      # number followed by colon is the record separator
NR==1 {                     # read the substring to b
    b=$0
    next
}
$0~b { c++ }                # if b matches current record, increment counter
END { print c }             # print counter value
' substringfile stringfile
2

此解决方案要求匹配与空间量相同，并且您的示例将无法正常工作，因为子字符串在缩进中的空间少于字符串。请注意，由于所选的RS匹配，例如phy0:是不可能的;在这种情况下，像RS="(^|\n)[0-9]+:"这样的东西可能会起作用。

另：

$ awk '
BEGIN{ RS="^$" }                           # treat whole files as one record
NR==1 { b=$0; next }                       # buffer substringfile
{
    while(match($0,b)) {                   # count matches of b in stringfile
        $0=substr($0,RSTART+RLENGTH-1)
        c++
    }
}
END { print c }                            # output
' substringfile stringfile

修改：当然，请删除BEGIN部分并使用Bash的流程替换，如下所示：

$ awk '
NR==1 { 
    b=$0
    gsub(/^ +| +$/,"",b)                 # clean surrounding space from substring
    next 
}
{
    while(match($0,b)) {
        $0=substr($0,RSTART+RLENGTH-1)
        c++
    }
}
END { print c }
' <(echo $SUB_STRING) <(echo $STRING)    # feed it with process substitution
2

echo进程替换会使数据变平并删除重复的空格：

$ echo $SUB_STRING
Bluetooth Soft blocked: no Hard blocked: no

因此空间问题应该有所缓解。

Edit2 ：基于@ EdMorton在评论中的鹰派观察：

$ awk '
NR==1 { 
    b=$0
    gsub(/^ +| +$/,"",b)                 # clean surrounding space from substring
    next 
}
{ print gsub(b,"") }
' <(echo $SUB_STRING) <(echo $STRING)    # feed it with process substitution
2

Answer 2

根据以下评论进行更新，如果两个字符串中的空格相同：

awk 'BEGIN{print gsub(ARGV[2],"",ARGV[1])}' "$STRING" "$SUB_STRING"

或者如果空格不同，例如STRING行以9个空格开头但SUB_STRING以8开头：

$ awk 'BEGIN{gsub(/[[:space:]]+/,"[[:space:]]+",ARGV[2]); print gsub(ARGV[2],"",ARGV[1])}' "$STRING" "$SUB_STRING"

原始答案：

使用GNU awk，如果文件与搜索字符串之间的空格不匹配，则只需要包含RE元字符：

awk -v RS='^$' 'NR==FNR{str=$0; next} {print gsub(str,"")}' str file

或任何awk，如果您的输入也不包含NUL字符：

awk -v RS='\0' 'NR==FNR{str=$0; next} {print gsub(str,"")}' str file

但有关解释的完整解决方案，请继续阅读：

在任何UNIX机器上的任何shell中都有任何POSIX awk：

$ cat str
Bluetooth
        Soft blocked: no
        Hard blocked: no

$ awk '
NR==FNR { str=(str=="" ? "" : str ORS) $0; next }
{ rec=(rec=="" ? "" : rec ORS) $0 }
END {
    gsub(/[^[:space:]]/,"[&]",str) # make sure each non-space char is treated as literal
    gsub(/[[:space:]]+/,"[[:space:]]+",str) # make sure space differences do not matter
    print gsub(str,"",rec)
}
' str file
2

使用非POSIX类似nawk的awk只需使用0-9而不是[:space:]。如果你的搜索字符串可以包含反斜杠，那么我们需要再添加1个gsub（）来处理它们。

或者，使用GNU awk进行多字符RS：

$ awk -v RS='^$' 'NR==FNR{gsub(/[^[:space:]]/,"[&]"); gsub(/[[:space:]]+/,"[[:space:]]+"); str=$0; next} {print gsub(str,"")}' str file
2

或任何awk，如果您的输入不能包含NUL字符：

$ awk -v RS='\0' 'NR==FNR{gsub(/[^[:space:]]/,"[&]"); gsub(/[[:space:]]+/,"[[:space:]]+"); str=$0; next} {print gsub(str,"")}' str file
2

等等...

Answer 3

您可以尝试使用GNU grep：

grep -zo -P ".*Bluetooth\n\s*Soft blocked: no\n\s*Hard blocked: no" <your_file> | grep -c "Bluetooth"

第一个grep将在多行上匹配并仅显示匹配的组。从该匹配计算蓝牙的出现次数将为您计算匹配的“子串”。

第一个grep的输出：

1: asus-bluetooth: Bluetooth
         Soft blocked: no
         Hard blocked: no
113: hci0: Bluetooth
         Soft blocked: no
         Hard blocked: no

输出整个命令：

Answer 4

使用python：

#! /usr/bin/env python

import sys
import re

with open(sys.argv[1], 'r') as i:
    print(len(re.findall(sys.argv[2], i.read(), re.MULTILINE)))

调用

$ ./search.py file.txt 'Bluetooth
 +Soft blocked: no
 +Hard blocked: no'

+允许一个或多个空格。

修改

如果内容已经在bash变量中，那就更简单了

#! /usr/bin/env python

import sys
import re

print(len(re.findall(sys.argv[2], sys.argv[1], re.MULTILINE)))

调用

$ ./search.py "$STRING" "$SUB_STRING"

Answer 5

这可能适合你（GNU sed＆amp; wc）：

sed -nr 'N;/^(\s*)Soft( blocked: no\s*)\n\1Hard\2$/P;D' file | wc -l

为每次出现的多行匹配输出一行并计算行数。

Answer 6

另一个awk

awk '
  NR==FNR{
    b[i++]=$0          # get each line of string in array b
    next}
  $0 ~ b[0]{            # if current record match first line of string
    for(j=1;j<i;j++){
      getline
      if($0!~b[j])  # next record do not match break
        j+=i}
     if(j==i)         # all record match string
       k++}
  END{
    print k}
' stringfile infile

编辑：

对于OP的XY问题，一个简单的脚本：

cat scriptbash.sh

list="${1//$'\n'/@}"
var="${2//$'\n'/@}"
result="${list//$var}"
echo $(((${#list} - ${#result}) / ${#var}))

你这样称呼它：

./ scriptbash.sh＆＃34; $ String＆＃34; ＆＃34; $ SUB_STRING＆＃34;

计算字符串中子字符串的出现次数

6 个答案:

修改