Question

我的意图是编写一个Shell脚本，使用正则表达式从文件中提取模式，并使用该模式的所有出现填充数组以进行学习。

实现此目标的最佳方法是什么？

我正在尝试使用sed做到这一点。我面临的一个问题是这些模式可以包含换行符，并且必须考虑这些换行符，例如：

文件内容：

"My name 
is XXX"
"My name is YYY"
"Today
is
the "

当我提取双引号之间的所有模式（包括双引号）时，第一次出现的输出必须为：

"My name
is XXX"

Answer 1

用模式的所有出现填充数组

首先将文件转换为有意义的定界符，例如。空字节，带ex。用-z开关sed的GNU：

sed -z 's/"\([^"]*\)"[^"]*/\1\00/g'

我在末尾添加了[^"]*，以便删除不在"之间的字符。

解析它变得更加简单。

您可以使用以下方法获取第一个元素：

head -z -n1

或对出现的次数进行排序和计数：

sort -z | uniq -z -c

或使用bash的maparray加载到数组：

maparray -d '' -t arr < <(<input sed -z 's/"\([^"]*\)"[^"]*/\1\00/'g))

或者您可以使用ex。 $'\01'作为分隔符，只要它是唯一的，就可以轻松地将这些数据解析为bash。

处理此类流在bash中有点困难。您不能在带有嵌入式空字节的shell中设置变量值。还期望有时会出现关于命令替换的警告。通常，当处理任意字节的数据时，我用xxd -p将其转换为纯ascii，然后使用xxd -r -p进行转换。有了它，变得更加容易。

以下脚本：

cat <<'EOF' >input
"My name
is XXX"
"My name is YYY"
"Today
is
the "
EOF

sed -z 's/"\([^"]*\)"[^"]*/\1\x00/g' input > input_parsed

echo "##First element is:"
printf '"'
<input_parsed head -z -n1 
printf '"\n'

echo "##Elemets count are:"
<input_parsed sort -z | uniq -z -c

echo
echo "##The array is:"
mapfile -d '' -t arr <input_parsed
declare -p arr

将输出（由于uniq的非换行定界输出，格式略有偏离）：

##First element is:
"My name
is XXX"
##Elemets count are:
      1 My name
is XXX      1 My name is YYY      1 Today
is
the 
##The array is:
declare -a arr=([0]=$'My name\nis XXX' [1]="My name is YYY" [2]=$'Today\nis\nthe ')

在repl.it上进行了测试。

Answer 2

这可能是您要寻找的，具体取决于对the questions I posted in a comment的回答：

$ readarray -d '' -t arr < <(grep -zo '"[^"]*"' file)

$ printf '%s\n' "${arr[0]}"
"My name
is XXX"

$ declare -p arr
declare -a arr=([0]=$'"My name \nis XXX"' [1]="\"My name is YYY\"" [2]=$'"Today\nis\nthe "')

它对-z使用GNU grep。

Answer 3

Sed可以提取带或不带换行符的所需模式。但是，如果要将多个结果存储到bash数组中，使用bash regex可能会更容易。
然后，请尝试以下操作：

lines=$(< "file")                   # slurp all lines
re='"[^"]+"'                        # regex to match substring between double quotes
while [[ $lines =~ ($re)(.*) ]]; do
    array+=("${BASH_REMATCH[1]}")   # push the matched pattern to the array
    lines=${BASH_REMATCH[2]}        # update $lines with the remaining part
done

# report the result
for (( i=0; i<${#array[@]}; i++ )); do
    echo "$i: ${array[$i]}"
done

输出：

0: "My name
is XXX"
1: "My name is YYY"
2: "Today
is
the "

如何从文件中提取模式并用它们填充bash数组？

3 个答案: