我得到了很多痛苦,因为这段代码不起作用。我尝试从html文件中提取所有html标签和javascript标签以及javascript的内容,并获得清晰的内容。
sed -e 's/<[^>]\+>/ /g' -e '/<script/,/<\/script>/d'
此代码删除了html标记和脚本标记,但未删除脚本内容。
sed -e 's/<[^>]\+>/ /g' -e 's/<script>try.*<\/script>//'
这应该适用于更多脚本标签,但仍然不会删除内容。 然而,这段代码正在删除脚本和内容,但我似乎无法让它与html删除一起工作。
awk '/<script>/{p=1} /<\/script>/{p=0;next}!p'
因此,当我将它组合并制作类似下一代码的内容时,它会删除脚本和内容但HTML标记仍然存在
sed 's/<[^>]\+>/ /g' | awk '/<script>/{p=1} /<\/script>/{p=0;next}!p'
示例数据:
<html>
<head>
<title>BTKRSH //</title>
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<script>
//test test
</script>
<body>
<div class="left">
<table style="width: 100%; height: 100%;">
<div id="closebtn">
<a class="hidden-x"> <img src="x-gray.png"></img> </a>
</div>
<tr><td style="vertical-align: middle; text-align: center;">
<div class="menu">
<a>PODCASTS</a>
<div class="hidden-menu podcasts">
<iframe width="400" height="400" src="https://www.mixcloud.com/widget/iframe/?feed=http%3A%2F%2Fwww.mixcloud.com%2FBTKRSH%2F&embed_uuid=f78341ae-da15-480f-9604-d6812bb9a83d&replace=0&stylecolor=190303&embed_type=widget_standard" frameborder="0"></iframe><div style="clear: both; height: 3px; width: 392px;"></div><p style="display: block; font-size: 11px; font-family: 'Open Sans', Helvetica, Arial, sans-serif; margin: 0px; padding: 3px 4px; color: rgb(153, 153, 153); width: 392px;"><a href="http://www.mixcloud.com/BTKRSH/?utm_source=widget&amp;utm_medium=web&amp;utm_campaign=base_links&amp;utm_term=resource_link" target="_blank" style="color: rgb(25, 3, 3); font-weight: bold;">
</div>
<a>RELEASES</a>
<div class="hidden-menu releases">
</div>
<a>ARTISTS</a>
<div class="hidden-menu artists">
</div>
<a>LINKS</a>
<div class="hidden-menu links">
</div>
<a>ABOUT</a>
<div class="hidden-menu about">
</div>
<a>CONTACT</a>
<div class="hidden-menu contact">
</div>
</div>
</td></tr>
</table>
</div>
<div class="right">
<table style="width:100%; height: 100%;">
<tr><td style="vertucal-align: middle; text-align: center">
<img src="2001_7.jpg" class="btkrsh-mask" width="600" height="500" ></img>
<div id="graph-art">
<p>BACKGROUND ARTIST</p>
<a href="http://www.facebook.com/btkrsh">SIMON C PAIGE<a>
</div>
<td></tr>
</table>
</div>
</body>
结果:
BTKRSH //
//test test
PODCASTS
RELEASES
ARTISTS
LINKS
ABOUT
CONTACT
BACKGROUND ARTIST
SIMON C PAIGE
或者当我使用删除脚本标签和内容的代码时,我得到了这个:
<html>
<head>
<title>BTKRSH //</title>
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<div class="left">
<table style="width: 100%; height: 100%;">
<div id="closebtn">
<a class="hidden-x"> <img src="x-gray.png"></img> </a>
</div>
<tr><td style="vertical-align: middle; text-align: center;">
<div class="menu">
<a>PODCASTS</a>
<div class="hidden-menu podcasts">
<iframe width="400" height="400" src="https://www.mixcloud.com/widget/iframe/?feed=http%3A%2F%2Fwww.mixcloud.com%2FBTKRSH%2F&embed_uuid=f78341ae-da15-480f-9604-d6812bb9a83d&replace=0&stylecolor=190303&embed_type=widget_standard" frameborder="0"></iframe><div style="clear: both; height: 3px; width: 392px;"></div><p style="display: block; font-size: 11px; font-family: 'Open Sans', Helvetica, Arial, sans-serif; margin: 0px; padding: 3px 4px; color: rgb(153, 153, 153); width: 392px;"><a href="http://www.mixcloud.com/BTKRSH/?utm_source=widget&amp;utm_medium=web&amp;utm_campaign=base_links&amp;utm_term=resource_link" target="_blank" style="color: rgb(25, 3, 3); font-weight: bold;">
</div>
<a>RELEASES</a>
<div class="hidden-menu releases">
</div>
<a>ARTISTS</a>
<div class="hidden-menu artists">
</div>
<a>LINKS</a>
<div class="hidden-menu links">
</div>
<a>ABOUT</a>
<div class="hidden-menu about">
</div>
<a>CONTACT</a>
<div class="hidden-menu contact">
</div>
</div>
</td></tr>
</table>
</div>
<div class="right">
<table style="width:100%; height: 100%;">
<tr><td style="vertucal-align: middle; text-align: center">
<img src="2001_7.jpg" class="btkrsh-mask" width="600" height="500" ></img>
<div id="graph-art">
<p>BACKGROUND ARTIST</p>
<a href="http://www.facebook.com/btkrsh">SIMON C PAIGE<a>
</div>
<td></tr>
</table>
</div>
</body>
您会看到脚本标记和内容已经消失了
欢迎任何帮助,谢谢!