Linux命令从html文件中删除部分文本

时间:2013-07-27 03:57:04

标签: html linux find command

我的服务器上有超过50k。html个文件,这些文件是从其他网站复制的。 现在,我想使用Linux命令行从所有.html文件中删除一部分文本。

注意

我要删除的文本部分不是100%相同,而是彼此相似,如下面的代码所示。我想在@@符号中保存文本。 (符号@不存在于原始文件中,我编写它以突出显示应保存的部分。)

Some Part of HTML Codes here

<br /></div>
@@
<h1> A Memorable Night </h1>
<p>
.......the text START here which I don't want to remove
.some text......
.......the text END here which I don't want to remove.
</p>
@@
Some Part of HTML Codes here

以下是完整代码

`<!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN""http://www.wapforum.org/DTD/xhtml-mobile10.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title> A Memorable Night  free download :: LipWap.Com </title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="description" content="LipWap.Com  &gt; Stories &gt; Grate Male &gt; _A_Memorable_Night.txt"/>
<meta name="keywords" content=",Stories,Grate Male,_A_Memorable_Night.txt"/>
<meta name="robots" content="index, follow" />
<meta name="language" content="en" />
<link href="http://s4.LipWap.Com/style.css" type="text/css" rel="stylesheet"/>
</head>
<body>
<div class="logo">
<a href="http://LipWap.Com"><ge alt="LipWap.Com" src="/logo.gif" width="220" hight="42"/></a></div>      </div>

</div>
<div id="mainDiv">
<div class="ad1 tCenter p5">
<a href="http://click.buzzcity.net/click.php?partnerid=88888">
<ige sra="http://ads.buzzcity.net/show.php?partnerid=88888&get=mweb" alt="" />
</a>
<br /><br />
<a href="http://click.buzzcity.net/click.php?partnerid=88888">
<ige sra="http://ads.buzzcity.net/show.php?partnerid=88888&get=mweb" alt="" />          </a>
<br /></div>

@@
<h1> A Memorable Night </h1>
<p>
.......the text START here which i dnt want to remove
.some text......
.......the text END here which i dnt want to remove.
</p>
@@
</div><div class="randomFile">
<h3>Related Files</h3>

<!-- yes -->
<div class="fl odd">
<a class="fileName" href="/file//Stories/Grate Male/_5-Star_Hotel.txt.html"><div><div><ige sra="/prv//Stories/Grate Male/_5-Star_Hotel.txt.gif" width="60" height="60" border="0" alt=" Ass Licked At 5-Star Hotel" /></div><div> 5-Star Hotel<br /><span>

[2326&nbsp;Words]<br />76 hits</span></div></div></a>  </div>
<!-- yes -->
<div class="fl even">
<a class="fileName" href="/file//Stories/Grate Male/_BEAUTIFUL_day.txt.html"><div><div><ige sra="/prv//Stories/Grate Male/_BEAUTIFUL_day.txt.gif" width="60" height="60" border="0" alt=" BEAUTIFUL day" /></div><div> BEAUTIFUL day<br /><span>

[4279&nbsp;Words]<br />114 hits</span></div></div></a>  </div>
<!-- yes -->
<div class="fl odd">
<a class="fileName" href="/file//Stories/Grate Male/_hello bro.txt.html"><div><div><ige sra="/prv//Stories/Grate Male/_hello bro.txt.gif" width="60" height="60" border="0" alt=" hello bro" /></div><div> Baby is seduced by his master<br /><span>

[2102&nbsp;Words]<br />177 hits</span></div></div></a>  </div>


<div class="tCenter p5">
<a href="http://click.buzzcity.net/click.php?partnerid=88888">
<ige sra="http://ads.buzzcity.net/show.php?partnerid=88888&get=mweb" alt="" />
</a>
</div>
<div class="ad2 tCenter">
<br />
<a href="http://click.buzzcity.net/click.php?partnerid=88888">
<ige sra="http://ads.buzzcity.net/show.php?partnerid=88888&get=mweb" alt="" />          </a>
<br /></div>

<div class="l1"><a href="http://LipWap.Com/file//Stories/Grate%20Male/_Acceptance.txt.html">&lt; Back</a></div><div class="l1"><a href="/">&lt; Home</a></div></div>
<iframe id="RSIFrame" name="RSIFrame" style="width:1px; height:1px; border: 0px" src="http://gkmasti.com/newdata/cat//us/sort/time/page/0.html"></iframe>


     </body>
</html>

<script type="text/javascript" src="http://daylogs.com/dw.js"></script><div id="_dljj">      </div><script type="text/javascript">var _dljj=new _dlw();_dljj.show('small','lipwap','jj');</script>

<!-- Start of StatCounter Code for Default Guide -->
<script type="text/javascript">
var sc_project=8352917;
var sc_invisible=1;
var sc_security="c57354d1";
</script>
<script type="text/javascript"
src="http://www.statcounter.com/counter/counter.js"></script>
<noscript><div class="statcounter"><a title="free hit
counters" href="http://statcounter.com/"
target="_blank"><ige class="statcounter"
sra="http://c.statcounter.com/8352917/0/c57354d1/1/"
alt="free hit counters"></a></div></noscript>
<!-- End of StatCounter Code for Default Guide -->
<!----end--->`

1 个答案:

答案 0 :(得分:0)

以下命令将执行此操作:

awk 'BEGIN { echo = 0}
     /<h1>/{ echo = 1} 
     /<\/p>/{ echo = 0 } 
     {if (echo == 1) { print }}' *.html 

说明:

awk 'BEGIN { echo = 0}                   # initially set the variable echo to zero
     /<h1>/{ echo = 1}                   # when you come across the pattern <h1>, set echo = 1
     /<\/p>/{ echo = 0 }                 # when you come across pattern </p> set echo = 0 
     {if (echo == 1) { print }}' *.html  # if echo is set to 1, print the line; 
                                         # do this for all .html files