Question

这感觉它应该是一项简单的任务，但不知何故无法将我的大脑包裹起来。我有带H1-H4标题的HTML文件。我想获得H3标签之间的内容。不是<H3>和</H3>之间的文本，而是两个H3之间的文本。

<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>

... 提前谢谢

我被要求描述一个示例输出，我想我在下面的评论中做过。我会重申一遍，如果不清楚，请告诉我。

输入：包含许多H3标题的长文件

输出：许多小文件，每个文件都包含一个以包含H3标题的行开头的片段，并在下一个H3标题之前的行上结束。

Answer 1

如果你没有发布预期的输出，我们只是猜测，但如果你真的想要</H3>和<H3>之间的文字，这里是GNU awk的一种方式：

$ cat file
<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" -v ORS= 'NR>1{print $NF}' file

<p> more text that I would like to grab</p>

<p> some more text that I'd like to get </p>
$

$ cat file
<H3>some text</H3><p>more text that I would like to grab</p><H3>some other text</H3><p> some more text that I'd like to get </p>
$ gawk -F'</H3>' -v RS="<H3>" -v ORS= 'NR>1{print $NF}' file
<p>more text that I would like to grab</p><p> some more text that I'd like to get </p>

$ gawk -F'</H3>' -v RS="<H3>" 'NR>1{print $NF}' file
<p>more text that I would like to grab</p>
<p> some more text that I'd like to get </p>

你需要GNU awk才能拥有多字符RS。

请注意，当您的块之间的文本中包含换行符时，这些换行符将在输出中再现，就像任何其他字符一样。

如果以上不是您想要的，请再次告诉我们更多......

Answer 2

问题是HTML语法非常灵活。例如：

<H3>some text</H3>
<p> more text that I would like to grab</p>
<H3> some other text </H3>
<p> some more text that I'd like to get </p>

和

<H3>
    some text
</H3>
<p> 
    more
    text
    that
    I
    would
    like
    to
    grab</p>
<H3> 
  some other text
        </H3>
<p>some        more     text that I'd        like to get
</p>

会产生相同的输出。剥离了额外的空格，标签可以全部分散。您不能简单地查找特定标签以了解您的目标。

唯一真正的方法是使用像Perl或Python这样的完整的脚本语言，它具有可以为您解析和组织HTML格式文件的模块。您无法使用Unix的正则表达式解析HTML或XML。

不幸的是，您已将其标记为 bash ， shell 或 awk ，并且这些都无法真正处理HTML输入干净的方式。

Answer 3

首先，这个shell行将提取第一个H3到H3部分...

$ sed -e '1,/<H3/d' -e '/<H3/,$d'

使用awk或shell脚本分块文件

3 个答案: