如何使用正则表达式从HTML获取数据

时间:2015-08-20 12:31:18

标签: php html regex html-parsing

我有以下HTML

<table class="profile-stats">
  <tr>
    <td class="stat">
      <div class="statnum">8</div>
      <div class="statlabel"> Tweets </div>
    </td>
    <td class="stat">
        <a href="/THEDJMHA/following">
          <div class="statnum">13</div>
          <div class="statlabel"> Following </div>
        </a>
    </td>
    <td class="stat stat-last">
        <a href="/THEDJMHA/followers">
          <div class="statnum">22</div>
          <div class="statlabel"> Followers </div>
        </a>
    </td>
  </tr>
</table>

我想从<td class="stat stat-last"> =&gt;获得价值<div class="statnum"> = 22

我已经尝试过跟随正则表达式,但没有找到匹配。

/<div\sclass="statnum">^(.)\?<\/div>/ig

4 个答案:

答案 0 :(得分:3)

这是使用解析器实现此目的的一种方法。

<?php
$html = '<table class="profile-stats">
  <tr>
    <td class="stat">
      <div class="statnum">8</div>
      <div class="statlabel"> Tweets </div>
    </td>
    <td class="stat">
        <a href="/THEDJMHA/following">
          <div class="statnum">13</div>
          <div class="statlabel"> Following </div>
        </a>
    </td>
    <td class="stat stat-last">
        <a href="/THEDJMHA/followers">
          <div class="statnum">22</div>
          <div class="statlabel"> Followers </div>
        </a>
    </td>
  </tr>
</table>';
$doc = new DOMDocument(); //make a dom object
$doc->loadHTML($html);
$tds = $doc->getElementsByTagName('td');
foreach ($tds as $cell) { //loop through all Cells
    if(strpos($cell->getAttribute('class'), 'stat-last')){
        $divs = $cell->getElementsByTagName('div');
        foreach($divs as $div) { // loop through all divs of the cell
            if($div->getAttribute('class') == 'statnum'){
                echo $div->nodeValue;
            }
        }
    }
}

输出:

  

22

...或使用xpath ...

$doc = new DOMDocument(); //make a dom object
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$statnums = $xpath->query("//td[@class='stat stat-last']/a/div[@class='statnum']");
foreach($statnums as $statnum) {
    echo $statnum->nodeValue;
}

输出:

  

22

或者如果你真的想要正则表达它......

<?php
$html = '<table class="profile-stats">
  <tr>
    <td class="stat">
      <div class="statnum">8</div>
      <div class="statlabel"> Tweets </div>
    </td>
    <td class="stat">
        <a href="/THEDJMHA/following">
          <div class="statnum">13</div>
          <div class="statlabel"> Following </div>
        </a>
    </td>
    <td class="stat stat-last">
        <a href="/THEDJMHA/followers">
          <div class="statnum">22</div>
          <div class="statlabel"> Followers </div>
        </a>
    </td>
  </tr>
</table>';
preg_match('~td class=".*?stat-last">.*?<div class="statnum">(.*?)<~s', $html, $num);
echo $num[1];

输出:

  

22

正则表达式演示:https://regex101.com/r/kM6kI2/1

答案 1 :(得分:2)

我认为如果你使用XML解析器而不是正则表达式会更好。 SimpleXML可以为您完成任务:http://php.net/manual/en/book.simplexml.php

答案 2 :(得分:2)

/<td class="stat stat-last">.*?<div class="statnum">(\d+)/si

您的比赛是在第一个捕捉组中。注意最后使用s选项。制作&#39;。&#39;匹配换行符。

答案 3 :(得分:1)

您可以像这样编辑您的模式:

/<div\sclass="statnum">(.*?)<\/div>/ig