Regex Patern在PHP上没有匹配,但在regex101上工作

时间:2018-03-29 19:16:30

标签: php regex

我需要一些帮助,我遇到了这个脚本的问题,我试图从网站上获取一些数据,所以我做了一些模式

/<div class="panel-heading"><a href="(.+?)"\/><h5>(.+?)<\/h5><\/a><\/div>
  <div class="panel-body">
    <p><b> Author: <\/b> (.+?)<\/p>
    <p><b> Awarding University: <\/b>  (.+?)<\/p>
    <p><b> Level  : <\/b> (.+?)<\/p>
    <p><b> Year: <\/b> (.+?)<\/p>
    <p><b> Holding Libraries: <\/b>  (.+?)<\/p>
<p><b> Subject Terms: <\/b> (.+?)<\/p>
    <b> Abstract: <\/b>(.*?)<\/p>
  <\/div>
<\/div>/su

这在regex101上正常工作但是当我把它放在php上它不会返回任何匹配

<?php
  ini_set('memory_limit', '-1');
  $myfile = fopen("info.txt", "r") or die("Unable to open file!");
  $filedata = fread($myfile,filesize("info.txt"));
  fclose($myfile);
  $re = '/<div class="panel-heading"><a href="(.+?)"\/><h5>(.+?)<\/h5><\/a><\/div>
        <div class="panel-body">
          <p><b> Author: <\/b> (.+?)<\/p>
          <p><b> Awarding University: <\/b>  (.+?)<\/p>
          <p><b> Level  : <\/b> (.+?)<\/p>
          <p><b> Year: <\/b> (.+?)<\/p>
          <p><b> Holding Libraries: <\/b>  (.+?)<\/p>
    <p><b> Subject Terms: <\/b> (.+?)<\/p>
          <b> Abstract: <\/b>(.*?)<\/p>
        <\/div>
      <\/div>/su';
      preg_match_all($re, $filedata, $matches, PREG_SET_ORDER, 0);

      var_dump($matches);
?>

任何人都可以告诉我我做错了什么?

这是我想要获取的数据的一个例子

<div class="panel panel-default">
  <div class="panel-heading"><a href="url"/><h5>Title</h5></a></div>
  <div class="panel-body">
    <p><b> Author: </b> author</p>
    <p><b> Awarding University: </b>  some stuff</p>
    <p><b> Level  : </b> PhD</p>
    <p><b> Year: </b> 0</p>
    <p><b> Holding Libraries: </b>  more stuff</p>
<p><b> Subject Terms: </b> other stuff</p>
    <b> Abstract: </b><p> Big text here</p>
  </div>
</div>

2 个答案:

答案 0 :(得分:0)

问题是空白字符。我更新了你的正则表达式以匹配所有空格字符,如空格,制表符,换行符等。您会注意到\s*匹配所需数据前面的所有空白字符和.*(也许有点软盘;))任何字符串,如换行标签和其他东西。

$pattern = '/<div class="panel-heading">.*<a href="(.+?)"\/><h5>\s*(.+?)\s*<\/h5><\/a><\/div>.*<div class="panel-body">.*<p><b>\s*Author\s*:.*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Awarding University\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Level\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Year\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Holding Libraries\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Subject Terms\s*:\s*<\/b>\s*(.+?)<\/p>.*<b>\s*Abstract\s*:\s*<\/b>\s*(.*?)<\/p>.*<\/div>.*<\/div>/su';

答案 1 :(得分:0)

这里有工作解决方案:

  $filedata = '<div class="panel panel-default">
  <div class="panel-heading"><a href="url"/><h5>Title</h5></a></div>
  <div class="panel-body">
    <p><b> Author: </b> author</p>
    <p><b> Awarding University: </b>  some stuff</p>
    <p><b> Level  : </b> PhD</p>
    <p><b> Year: </b> 0</p>
    <p><b> Holding Libraries: </b>  more stuff</p>
<p><b> Subject Terms: </b> other stuff</p>
    <b> Abstract: </b><p> Big text here</p>
  </div>
</div>';

  $re = '/<div class="panel-heading"><a href="(.+?)"\/><h5>(.+?)<\/h5><\/a><\/div>'
          . '\s+<div class="panel-body">'
          . '\s+<p><b> Author: <\/b> (.+?)<\/p>'
          . '\s+<p><b> Awarding University: <\/b>  (.+?)<\/p>'
          . '\s+<p><b> Level  : <\/b> (.+?)<\/p>'
          . '\s+<p><b> Year: <\/b> (.+?)<\/p>'
          . '\s+<p><b> Holding Libraries: <\/b>  (.+?)<\/p>'
          . '\s+<p><b> Subject Terms: <\/b> (.+?)<\/p>'
          . '\s+<b> Abstract: <\/b>(.*?)<\/p>/msu';
  preg_match_all($re, $filedata, $matches, PREG_SET_ORDER, 0);

  var_dump($matches);

问题是空格字符。

但我认为你应该看看这样的东西:http://php.net/manual/en/class.domdocument.php这个功能要强大得多。通过正则表达式解析html可能很棘手,为此目的使用库。