如何使用sed删除html标签

时间:2014-05-08 17:05:43

标签: sed

输入是:

<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6>

</body>
</html>

预期产出:

This is heading 1
This is heading 2
This is heading 3
This is heading 4
This is heading 5
This is heading 6

我试过sed -n 's/<[^>].*>//gp' example.html 但屏幕上什么都没有,似乎正则表达式不正确

3 个答案:

答案 0 :(得分:0)

如果您的版本支持PCRE的grep选项,

-P应该足够了。

$ grep -oP '(?<=>)(.[^<]+)(?=<)' file
This is heading 1
This is heading 2
This is heading 3
This is heading 4
This is heading 5
This is heading 6

答案 1 :(得分:-1)

sed -n 's/<[^>]*>//gp' test.csv | sed '/^$/d'

你几乎就在那里,你使用的点(。)可以匹配&#34;&gt;&#34;字符,所以从命令中删除它

管道后的命令是清除所有空行

答案 2 :(得分:-1)

处理您的样本

sed -n 's|</\{0,1\}h[0-9]>||gp' YourFile

替换任何和在线,如果有修改,请打印

更准确(假设标签

sed -n 's|^[[:space:]]*<\(h[0-9]>\)\(.*\)</\1|\2|p' YourFile