Question

我已经离开了这篇文章why not use regular expression for HTML。作为给我的任务的一部分，我别无选择，只能使用HTML的正则表达式。

我有HTML代码，并单独尝试

 <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

我能够使用以下正则表达式获得 13 ：

<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

并且类似来自

<td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

使用正则表达式获得5 明星

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(.*)</a>\s*</td>

但是当两个HTML代码组合在一起时，

<table id="histogramTable" class="a-normal a-align-middle a-spacing-base">

  <tr class="a-histogram-row">



        <td class="a-nowrap">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href="">5 star</a><span class="a-letter-space"></span>          

        </td>

        <td class="a-span10">

          <a class="a-link-normal" title="69% of reviews have 5 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 69.1358024691358%;"></div></div></a>

        </td>

        <td class="a-nowrap">

          <span class="a-letter-space"></span><span>13</span>

        </td>

  </tr>
  <td class="a-nowrap">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href="">1 star</a><span class="a-letter-space"></span>          

    </td>

    <td class="a-span10">

      <a class="a-link-normal" title="2% of reviews have 1 stars" href=""><div class="a-meter"><div class="a-meter-bar" style="width: 2.46913580246914%;"></div></div></a>

    </td>

    <td class="a-nowrap">

      <span class="a-letter-space"></span><span>2</span>

    </td>


</table>

如何使用正则表达式提取 5星和13 ？

Answer 1

如果您不想使用HTML解析器，请使用一个接一个的正则表达式或在两个模式之间添加.*这个，我已经修改了一下 star 正则表达式，因为它没有'工作正常：

首先启用 dotall 标记，然后使用：

<td class="a-nowrap">\s*<a class="a-link-normal" [^>]*>\s*(\d star).*<td class="a-nowrap">\s*<span class="a-letter-space"></span><span>(\d*)</span>\s*</td>

<强>输出：

第1组：5星

第2组：13

修改

我制作了更短的正则表达式：

<强> REGEX：

>(\d star)<.+?>(\d+?)<

在pythonregex.com上使用了您提供的已编辑输入的内容：

<强>输出：

>>> regex.findall(string) [(u'5 star', u'13'), (u'1 star', u'2')]

使用正则表达式解析HTML表行

1 个答案: