Question

我希望抓一些网页内容。

我有以下代码，但它不适用于每一页。

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-failsafe-plugin</artifactId>
    <version>2.12</version>
    <executions>
        <execution>
            <id>default</id>
            <goals>
                <goal>integration-test</goal>
                <goal>verify</goal>
            </goals>
        </execution>
    </executions>
</plugin>

$url1 = 'http://www.just-eat.co.uk/restaurants-tomyumgoong/menu'; $url2 = 'http://www.just-eat.co.uk/'; $curl = curl_init($url1); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); $page = curl_exec($curl); if (curl_errno($curl)) // check for execution errors { echo 'Scraper error: ' . curl_error($curl); exit; } echo $page; curl_close($curl); $regex = '/<div class="responsive-header-logo">(.*?)<\/div>/s'; if (preg_match($regex, $page, $list)) echo $list[0]; else print "Not found";无效，但当我使用$url1时，它就像魅力一样。

我该怎么做才能解决这个问题？

Answer 1

尝试将正则表达式简化为：

$regex = '/responsive-header-logo/';

Answer 2

试试这个正则表达式：/<div class="responsive-header-logo">([\s\S]*?)<\/div>/。

Dot匹配除换行符之外的任何字符，[\s\S]匹配任何字符+换行符。

对于正则表达式测试，我建议使用http://regexr.com/ - 此示例有效：http://regexr.com/3b56u

Answer 3

首先，你shouldn't use regex to parse HTML/XML。

相反，您应该使用专为其设计的库。因此，DOM或SimpleXML。

使用DOM的示例：

$dom = new DOMDocument();
$dom->loadHTML($html);
$finder = new DomXPath($dom);
$classname = "responsive-header-logo";
$nodes = $finder->query("//*[contains(@class, '$classname')]");

然后使用$dom->saveHTML提取HTML代码。

请参阅：How should I get a div's content like this using dom in php?

如何从网页中提取特定的跨度内容

3 个答案: