Question

我有一个功能，当我为某些东西（链接等）抓取网页时，我用它来帮助我：

function list_tags($html, $start, $end)
{
    preg_match_all("($start(.*)$end)siU", $html, $matching_data);
    return $matching_data[0];
}

使用示例：

$open_tag  = '<a';
$close_tag = '>';
$links     = list_tags($html, $open_tag, $close_tag);

因此print_r($links);导致：

Array
(
    [0] => <a href="blah.html">
    [1] => <a href="other_blah.html">
    Etc...
    Etc...
)

我使用时可以做同样的事情 $open_tag = '<script';或 $open_tag = '<div';等，但当我尝试使用$open_tag = '<input';时，尽管页面上有多个<input>标记，但我的数组完全为空。有什么想法吗？

修改

我要抓的特定页面是http://www.pcsoweb.com/inmatebooking/Inquiry.aspx。我在自己创建的页面上使用了同样的东西，它确实发现了我创建的所有`<input ... />。

我必须深入挖掘，找出阻止我抓住这个特定网站上的<input />代码的原因。

我还会调查DOMDocument课程，看看这是否能提供更好的效果。

感谢您提出建议， doublesharp 和 feeela 。我将进一步研究这一点，看看真正的问题是什么。

Answer 1

首选使用DOM解析器，但如果需要使用正则表达式来解析数据，请尝试使用/作为分隔符而不是(和)来使代码更多可读并使您的匹配组与?一起变得懒惰（删除U修饰符）：

function list_tags($html, $start, $end)
{
    // escape forward slashes in your pattern start and end
    $start = str_replace("/", "\/", $start);
    $end   = str_replace("/", "\/", $end);
    preg_match_all("/{$start}(.*?){$end}/si", $html, $matching_data);
    return $matching_data[0];
}

$html = "<input test='test'><a href='asdf'>";
$open_tag  = '<(input|a)';
$close_tag = '>';
$links     = list_tags($html, $open_tag, $close_tag);
print_r($links);

运行此代码会导致：

Array
(
    [0] => <input test='test'>
    [1] => <a href='asdf'>
)

Answer 2

如果我将正则表达式(<input(.*)>)siU粘贴到http://www.functions-online.com/preg_match_all.html

中

带

<a>dfg</a><input type="sdgf"/>

要注意的一件事是以/>结尾的输入（自闭）。您的设置有什么可能导致无法找到它？

没有HTML样本，很难说。

功能在所有情况下都无法正常工作

2 个答案: