Question

<div class="plot_summary minPlotHeightWithPoster">
            <div class="summary_text" itemprop="description">
                    King Leonidas of Sparta and a force of 300 men fight the Persians at Thermopylae in 480 B.C.
            </div>

我想在两个div锚标签之间提取文本。我是sed和awk的新手，所以，我无法弄清楚如何做到这一点。我尝试使用grep，但它没有成功。

Answer 1

Sundeep在评论中指出：最好使用正确的HTML解析器。

标准实用程序主要基于 line ，并且报价不佳;他们没有能力强大地解析HTML，它的所有可变性都围绕引用样式和空格，更不用说识别实际的语法。

GNU <div id="main"> <div class="share"> <div class="top-share">icons below <span></span></div> <div class="share-botton-fb"><a href="#"><div class="fb">FB</div></a></div> <div class="share-botton-tw"><a href="#"><div class="tw">TW</div></a></div> <div class="share-botton-gp"><a href="#"><div class="gp">GP</div></a></div> <div class="share-botton-inst"><a href="#"><div class="inst">INST</div></a></div> <div class="share-botton-pin"><a href="#"><div class="pin">PIN</div></a></div> </div> </div> 提供比其他实现更多的灵活性：多行匹配（grep），支持PCRE（{{1} }），它启用了外观断言。

虽然以下GNU -z命令适用于您的示例输入，但它仍远不是一个强大的解析解决方案：

-P

Answer 2

在Unix或Unix之类的终端上解析XML或HTML的推荐方法：

如果您正在寻找从unix命令行执行此操作的方法，我建议首先考虑使用xml解析工具而不是awk，grep或sed。

例如，您的系统可能有xmllint。如果你的html包含在index.html文件中。以下xmllint命令用于提取文本：

xmllint --html --xpath "//div[contains(@class, 'plot_summary')]/div[contains(@class, 'summary_text')]/text()" index.html

文本需要在该命令之后进行修剪，因此您可能会使用另一个命令来执行此操作：

(xpath="//div[contains(@class, 'plot_summary')]/div[contains(@class, 'summary_text')]/text()" && \
xmllint --html --xpath "$xpath" index.html) \
| sed -e 's/^[[:space:]]*//' -e '/^[[:space:]]*$/d'

我们正在抓取输出的sed命令有两个表达式。第一个删除行's/^[[:space:]]*//'开头的空格，第二个删除任何只是空格的行'/^[[:space:]]*$/d'

您可以研究其他xml命令行解析器工具（请参阅已接受的答案）： How to execute XPath one-liners from shell?

使用sed的可怕方式：

您可以通过使用echo将文件分成一行来解决sed的在线解析问题。然后使用sed替换，您可以提取所需的文本。这不是一个好方法，因为它是一种非常依赖于格式的方法：

(set -o noglob; echo $(cat index.html)) \
| sed 's/.*<div[^>]*class[^=]*=[^"]*"summary_text"[^>]*>[[:space:]]*\([^<]*\).*/\1/'

已更新为通过globbing每个mklement0的评论停用set command

使用sed，grep或awk在两个锚标记之间提取文本

2 个答案: