想象一下,多个HTML文件与所有剩余的格式,标签等合并 - 更别提为什么 - 应该使用哪些工具从后续合并的html文件的起始行进行搜索,即<!doctype html>...
到<h1>
标题的开头?该范围模式应该替换为水平规则。
---END OF PREV MERGED FILE---
---BEGIN SEARCH/REPLACE HERE---
<!doctype html>
<!--[if !IE]>
<html class="no-js non-ie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 7 ]>
<html class="no-js ie7" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 8 ]>
<html class="no-js ie8" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 9 ]>
---HEAD,META,ETC---
---END SEARCH/REPLACE HERE---
<h1>TITLE OF NEXT MERGED FILE</h1>
我不确定sed
和awk
是否是错误的工具,但首选类似的工具/解决方案。
输入
<li><strong>email_from = root@localhost</strong>, <strong>email_to = root</strong>, <strong>email_host = localhost</strong> defines respectively when the message is a mail the originator’s email address, the recipient’s
email address and the host to which the mail is sent.<strong><br />
30658 </strong></li>
30659 </ul>
30660 <p>Source: <a title="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7" href="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7">Linuxaria’s website</a>.</p>
30661 </div><!-- end of .post-entry -->
30662
30663 <div class="post-edit"></div>
30664 </div><!-- end of #post-4116 -->
30665
30666
30667 <!doctype html>
30668 <!--[if !IE]>
30669 <html class="no-js non-ie" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
30670 <!--[if IE 7 ]>
30671 <html class="no-js ie7" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
30672 <!--[if IE 8 ]>
30673 <html class="no-js ie8" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
30674 <!--[if IE 9 ]>
30675 <html class="no-js ie9" lang="en-US" prefix="og: http://ogp.me/ns#"> <![endif]-->
30676 <!--[if gt IE 9]><!-->
30677 <html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#"> <!--<![endif]-->
30678 <head>
30679 <meta charset="UTF-8"/>
30680 <meta name="viewport" content="width=device-width, initial-scale=1.0">
30681 <title>something something</title>
30682 <link rel="profile" href="http://gmpg.org/xfn/11"/>
30683 <link rel="pingback" href="www.example.com"/>
30684
30685 <h1 class="entry-title post-title">Something Something</h1>
预期输出
<li><strong>email_from = root@localhost</strong>, <strong>email_to = root</strong>, <strong>email_host = localhost</strong> defines respectively when the message is a mail the originator’s email address, the recipient’s
email address and the host to which the mail is sent.<strong><br />
30658 </strong></li>
30659 </ul>
30660 <p>Source: <a title="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7" href="http://linuxaria.com/howto/enabling-automatic-updates-in-centos-7-and-rhel-7">Linuxaria’s website</a>.</p>
30661 </div><!-- end of .post-entry -->
30662
30663 <div class="post-edit"></div>
30664 </div><!-- end of #post-4116 -->
<hr />
30685 <h1 class="entry-title post-title">Something Something</h1>
答案 0 :(得分:1)
这似乎可以做你想要的:
awk '/<!doctype html>/{f=1;print " <hr />";} /<h1 class=/{f=0;} !f' input >output
/<!doctype html>/{f=1;print " <hr />";}
当我们到达包含<!doctype html>
的行时,会将标记f
设置为1
,表示我们应该停止打印。然后,我们打印水平规则。
/<h1 class=/{f=0;}
当我们到达包含<h1 class=
的行时。将标记f
设置为0
,表示我们可以继续打印。
!f
如果f
为0
,则会导致打印当前行。
更详细地说,!f
是条件。当条件为真时,awk执行操作。由于未指定任何操作,awk将执行其默认操作,即打印该行。 !
是awk的否定符号。因此,当f
为假(0)时,则!f
为真,并且会打印该行。
假设我们要删除第一个除之外的所有doctype标记。在那种情况下:
awk '/<!doctype html>/{count++; if (count>1){f=1; print " <hr />";}} /<h1 class=/{f=0;} !f' input
这可以通过添加另一个变量count
来实现,该变量跟踪我们看到过多少个doctype标记。只有在我们看到多个doctype标记后,标记f
才会设置为1
。
为了演示上述内容,我们使用此输入文件:
$ cat input2
miscellaneous stuff
30667 <!doctype html>
30668 something
30669 <h1 class="entry-title post-title">Something Something</h1>
More stuff
30667 <!doctype html>
30668 something 2
30669 <h1 class="entry-title post-title">Something Something</h1>
Still More stuff
30667 <!doctype html>
30668 something 3
30669 <h1 class="entry-title post-title">Something Something</h1>
Stuff at end
该命令产生的输出是:
$ awk '/<!doctype html>/{count++; if (count>1){f=1; print " <hr />";}} /<h1 class=/{f=0;} !f' input2
miscellaneous stuff
30667 <!doctype html>
30668 something
30669 <h1 class="entry-title post-title">Something Something</h1>
More stuff
<hr />
30669 <h1 class="entry-title post-title">Something Something</h1>
Still More stuff
<hr />
30669 <h1 class="entry-title post-title">Something Something</h1>
Stuff at end