如何限制awk只搜索某个HTML标记中包含的项目?

时间:2013-04-21 00:45:12

标签: awk replace

我有一个像这样的AWK脚本,我将在一个文件上运行:

cat input.txt | awk 'gsub(/[^ ]*(fish|shark|whale)[^ ]*/,"(&)")' >> output.txt

这为所有包含单词“fish”,“shark”或“whale”的行添加括号,例如:

The whale asked the shark to swim elsewhere.
The fish were unhappy.

在脚本中运行后,文件变为:

The (whale) asked the (shark) to swim elsewhere.
The (fish) were unhappy.

该文件标有HTML标记,我只需要在<b></b>标记之间进行替换。

The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.

这变为:

The whale asked <b> the (shark) to swim </b> elsewhere.
<b> The (fish) were </b> unhappy.
  • 匹配的粗体标签永远不会放在不同的行上。起始<b>代码始终与结束</b>代码显示在同一行。

如何限制awk的搜索仅搜索和修改<b></b>代码之间的文字?

2 个答案:

答案 0 :(得分:1)

只要HTML标记不差,并且<b> ... </b>跨度不包含任何其他HTML标记,那么在Perl中它相对容易:

$ cat data
The whale asked <b>the shark to swim</b> elsewhere.
<b>The fish were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$ perl -pe 's/(<b>[^<]*)\b(fish|shark|whale)\b([^<]*<\/b>)/\1(\2)\3/g'  data | so
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.
The <b> dogfish and the sharkfin soup</b> were unscathed.
$ 

我尝试将其改编为awk(和gawk),但没有成功;匹配部分工作,但替换表达式没有。与Perl不同,阅读手册时,您无法在括号中识别单独的匹配子表达式。

答案 1 :(得分:1)

这是一种使用awk的技术:

awk '/<b>/{f=1}/<\/b>/{f=0}f{gsub(/fish|shark|whale/,"(&)")}1' RS=' ' ORS=' ' file
The whale asked <b>the (shark) to swim</b> elsewhere.
<b>The (fish) were</b> unhappy.