我需要一些帮助,我遇到了这个脚本的问题,我试图从网站上获取一些数据,所以我做了一些模式
/<div class="panel-heading"><a href="(.+?)"\/><h5>(.+?)<\/h5><\/a><\/div>
<div class="panel-body">
<p><b> Author: <\/b> (.+?)<\/p>
<p><b> Awarding University: <\/b> (.+?)<\/p>
<p><b> Level : <\/b> (.+?)<\/p>
<p><b> Year: <\/b> (.+?)<\/p>
<p><b> Holding Libraries: <\/b> (.+?)<\/p>
<p><b> Subject Terms: <\/b> (.+?)<\/p>
<b> Abstract: <\/b>(.*?)<\/p>
<\/div>
<\/div>/su
这在regex101上正常工作但是当我把它放在php上它不会返回任何匹配
<?php
ini_set('memory_limit', '-1');
$myfile = fopen("info.txt", "r") or die("Unable to open file!");
$filedata = fread($myfile,filesize("info.txt"));
fclose($myfile);
$re = '/<div class="panel-heading"><a href="(.+?)"\/><h5>(.+?)<\/h5><\/a><\/div>
<div class="panel-body">
<p><b> Author: <\/b> (.+?)<\/p>
<p><b> Awarding University: <\/b> (.+?)<\/p>
<p><b> Level : <\/b> (.+?)<\/p>
<p><b> Year: <\/b> (.+?)<\/p>
<p><b> Holding Libraries: <\/b> (.+?)<\/p>
<p><b> Subject Terms: <\/b> (.+?)<\/p>
<b> Abstract: <\/b>(.*?)<\/p>
<\/div>
<\/div>/su';
preg_match_all($re, $filedata, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
?>
任何人都可以告诉我我做错了什么?
这是我想要获取的数据的一个例子
<div class="panel panel-default">
<div class="panel-heading"><a href="url"/><h5>Title</h5></a></div>
<div class="panel-body">
<p><b> Author: </b> author</p>
<p><b> Awarding University: </b> some stuff</p>
<p><b> Level : </b> PhD</p>
<p><b> Year: </b> 0</p>
<p><b> Holding Libraries: </b> more stuff</p>
<p><b> Subject Terms: </b> other stuff</p>
<b> Abstract: </b><p> Big text here</p>
</div>
</div>
答案 0 :(得分:0)
问题是空白字符。我更新了你的正则表达式以匹配所有空格字符,如空格,制表符,换行符等。您会注意到\s*
匹配所需数据前面的所有空白字符和.*
(也许有点软盘;))任何字符串,如换行标签和其他东西。
$pattern = '/<div class="panel-heading">.*<a href="(.+?)"\/><h5>\s*(.+?)\s*<\/h5><\/a><\/div>.*<div class="panel-body">.*<p><b>\s*Author\s*:.*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Awarding University\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Level\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Year\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Holding Libraries\s*:\s*<\/b>\s*(.+?)<\/p>.*<p><b>\s*Subject Terms\s*:\s*<\/b>\s*(.+?)<\/p>.*<b>\s*Abstract\s*:\s*<\/b>\s*(.*?)<\/p>.*<\/div>.*<\/div>/su';
答案 1 :(得分:0)
这里有工作解决方案:
$filedata = '<div class="panel panel-default">
<div class="panel-heading"><a href="url"/><h5>Title</h5></a></div>
<div class="panel-body">
<p><b> Author: </b> author</p>
<p><b> Awarding University: </b> some stuff</p>
<p><b> Level : </b> PhD</p>
<p><b> Year: </b> 0</p>
<p><b> Holding Libraries: </b> more stuff</p>
<p><b> Subject Terms: </b> other stuff</p>
<b> Abstract: </b><p> Big text here</p>
</div>
</div>';
$re = '/<div class="panel-heading"><a href="(.+?)"\/><h5>(.+?)<\/h5><\/a><\/div>'
. '\s+<div class="panel-body">'
. '\s+<p><b> Author: <\/b> (.+?)<\/p>'
. '\s+<p><b> Awarding University: <\/b> (.+?)<\/p>'
. '\s+<p><b> Level : <\/b> (.+?)<\/p>'
. '\s+<p><b> Year: <\/b> (.+?)<\/p>'
. '\s+<p><b> Holding Libraries: <\/b> (.+?)<\/p>'
. '\s+<p><b> Subject Terms: <\/b> (.+?)<\/p>'
. '\s+<b> Abstract: <\/b>(.*?)<\/p>/msu';
preg_match_all($re, $filedata, $matches, PREG_SET_ORDER, 0);
var_dump($matches);
问题是空格字符。
但我认为你应该看看这样的东西:http://php.net/manual/en/class.domdocument.php这个功能要强大得多。通过正则表达式解析html可能很棘手,为此目的使用库。