Question

我有一个像这样的AWK脚本，我将在一个文件上运行：

cat input.txt | awk 'gsub(/[^ ]*(fish|shark|whale)[^ ]*/,"(&)")' >> output.txt

这为所有包含单词“fish”，“shark”或“whale”的行添加括号，例如：

The whale asked the shark to swim elsewhere.
The fish were unhappy.

在脚本中运行后，文件变为：

The (whale) asked the (shark) to swim elsewhere.
The (fish) were unhappy.

该文件标有HTML标记，我只需要在和标记之间进行替换。

The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.

这变为：

The whale asked <b> the (shark) to swim </b> elsewhere.
<b> The (fish) were </b> unhappy.

匹配的粗体标签永远不会放在不同的行上。起始代码始终与结束代码显示在同一行。

如何限制awk的搜索仅搜索和修改和代码之间的文字？

Answer 1

只要HTML标记不差，并且 ... 跨度不包含任何其他HTML标记，那么在Perl中它相对容易：

$ cat data
The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$ perl -pe 's/(<b>[^<]*)\b(fish|shark|whale)\b([^<]*<\/b>)/\1(\2)\3/g'  data | so
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$

我尝试将其改编为awk（和gawk），但没有成功;匹配部分工作，但替换表达式没有。与Perl不同，阅读手册时，您无法在括号中识别单独的匹配子表达式。

Answer 2

这是一种使用awk的技术：

awk '/<b>/{f=1}/<\/b>/{f=0}f{gsub(/fish|shark|whale/,"(&)")}1' RS=' ' ORS=' ' file
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.

如何限制awk只搜索某个HTML标记中包含的项目？

2 个答案: