Question

下面是我的Regexp，当我直接为其分配html内容时效果很好。但不使用file_get_contents（）

<?php 
$url = "http://www.apdepot.com/Products/SearchResults.aspx?type=keyword&keyword=6-918873";

$urlcontent = file_get_contents($url);

/*  It works when I assign html content to it but now working with file_get_contents(). 

 $urlcontent = '<td width="80%" valign="top" align="left">
                          <span id="ContentPlaceHolder1_Repeater1_lblLongDesc_0">*WAS W10224675 M BASKT-WARE WAS W10171734</span> <input type="hidden" value="*WAS W10224675 M BASKT-WARE WAS W10171734" id="ContentPlaceHolder1_Repeater1_hdnP21Desc_0" name="ctl00$ContentPlaceHolder1$Repeater1$ctl01$hdnP21Desc">
                          </td>'; */


preg_match_all('/<span.*id=\"ContentPlaceHolder1_Repeater1_lblLongDesc_0\".*>(.*?)<\/span>/Us', $urlcontent, $name);
        print_r($name);

预期产出 -

Array
(
    [0] => Array
        (
            [0] => <span id="ContentPlaceHolder1_Repeater1_lblLongDesc_0">*WAS W10224675 M BASKT-WARE WAS W10171734</span>
        )

    [1] => Array
        (
            [0] => *WAS W10224675 M BASKT-WARE WAS W10171734
        )

)

更新

同样不适用于锚标记

$url = "http://www.apdepot.com/Products/SearchResults.aspx?type=keyword&keyword=6-918873";

    $urlcontent = file_get_contents($url);

    $name = '<td valign="top" align="left" class="SearchResultItemHeader">
                    <a class="thickbox" title="Dishwasher Tube/Spray Arm Kit" href="ItemDetailsPopup.aspx?itemcode=WHI%20675808&amp;keepThis=true&amp;TB_iframe=true&amp;height=500&amp;width=640"><b>Dishwasher Tube/Spray Arm Kit</b></a>                          
                      </td>';

    preg_match_all('/<a.*class=\"thickbox\".*title=\"(.*?)\".*href=\"ItemDetailsPopup.aspx\?itemcode.*\">.*<b>(.*)<\/b><\/a>/s', $name, $nameoutput);
    print_r($nameoutput);

预期产出 -

标签

中的

文字

Dishwasher Tube/Spray Arm Kit

Answer 1

尝试：

preg_match_all('/<span id=\"ContentPlaceHolder1_Repeater1_lblLongDesc_0\".*>(.*)<\/span>/Us', $urlcontent, $name);

输出：

Array

    (
        [0] => Array
            (
                [0] => <span id="ContentPlaceHolder1_Repeater1_lblLongDesc_0">*WAS W10224675 M BASKT-WARE WAS W10171734</span>
            )

        [1] => Array
            (
                [0] => *WAS W10224675 M BASKT-WARE WAS W10171734
            )

    )

对于数据报废，xpath是最佳选择。看看下面的例子：

$url = "http://www.apdepot.com/Products/SearchResults.aspx?type=keyword&keyword=6-918873";

$urlcontent = file_get_contents($url);

$doc = new DOMDocument();

$doc->loadHTML($urlcontent);

$xpath = new DOMXpath($doc);

$elements = $xpath->query("//span[@id='ContentPlaceHolder1_Repeater1_lblLongDesc_0']")->item(0)->nodeValue;

echo $elements;

//output: *WAS W10224675 M BASKT-WARE WAS W10171734

有关详细信息，请查看http://php.net/manual/en/class.domdocument.php和http://php.net/manual/en/class.domxpath.php

anchor代码和b代码的示例：

$urlcontent = '<td valign="top" align="left" class="SearchResultItemHeader">
                    <a class="thickbox" title="Dishwasher Tube/Spray Arm Kit" href="ItemDetailsPopup.aspx?itemcode=WHI%20675808&amp;keepThis=true&amp;TB_iframe=true&amp;height=500&amp;width=640"><b>Dishwasher Tube/Spray Arm Kit</b></a>
                      </td>';

$doc = new DOMDocument();

$doc->loadHTML($urlcontent);

$xpath = new DOMXpath($doc);

$elements = $xpath->query("//td[@class='SearchResultItemHeader']/a/b")->item(0)->nodeValue;

echo $elements;

////output: Dishwasher Tube/Spray Arm Kit

Answer 2

像这样改变Regexp -

preg_match_all('%<span.*id=\"ContentPlaceHolder1_Repeater1_lblLongDesc_0\"(.*)\/span>%', $urlcontent, $desc);

然后你可以应用下面的strip_tags（）

$description = strip_tags($desc[1][0]);

输出 -

Array

    (
        [0] => Array
            (
                [0] => <span id="ContentPlaceHolder1_Repeater1_lblLongDesc_0">*WAS W10224675 M BASKT-WARE WAS W10171734</span>
            )

        [1] => Array
            (
                [0] => *WAS W10224675 M BASKT-WARE WAS W10171734
            )

    )

Regexp不使用file_get_contents作为span标记

更新

同样不适用于锚标记

2 个答案: