Question

我想从html tags

中获取产品标题等页面中的特定数据

以下是我在网站上的div代码 -

    <div class="pdct-inf">
    <h2 class="h6" style="min-height:38px;height:38px;">
<a id="ctl00_cphMain_rPdctG_ctl01_hTitle" href="/whirlpool-whirlpool-direct-drive-285753a-ap3963893.html">Whirlpool Direct Drive Washer Mot...</a></h2><div class="startext">
<div itemprop="reviewRating" itemscope="" itemtype="http://schema.org/Rating" style="cursor:pointer; float:left; text-align:right;" class="page-style-stars-web-sm rating-5"></div>
<meta itemprop="worstRating" content="1"><meta itemprop="bestRating" content="5"><meta itemprop="ratingValue" content="5">&nbsp;(<a href="/whirlpool-whirlpool-direct-drive-285753a-ap3963893.html#diy">434</a>)
    </div>
    </div>

我想在Whirlpool Direct Drive Washer Mot...

之间获取此文字<a>

下面是我的PHP代码 -

<?php

$html = file_get_contents("http://www.programminghelp.com/");

preg_match_all(
    '/<h2><a href="(.*?)" rel="bookmark" title=".*?">(.*?)<\/a><\/h2>/s',
    $html,
    $posts, // will contain the article data
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];
    $title = $post[2];

    echo $title . "\n";
}

echo "<p>" . count($posts) . " product found</p>\n";

?>

我需要帮助为上面的div内容编写正则表达式。

preg_match_all(
        '/<h2><a href="(.*?)" rel="bookmark" title=".*?">(.*?)<\/a><\/h2>/s',

Answer 1

也许像this这样的HTML / XML解析器会更合适。（正则表达式不适合解析[X] HTML，如评论中所述）

Answer 2

如果你想使用正则表达式，你可以试试这样的东西

/<h2.*>\s*<a.* href="(.*)">(.*)<\/a>/m

您可以看到它与您的示例in this php sandbox一起使用。

需要在RegExp下面进行修正

2 个答案: