Question

我尝试使用简单的Dom Parser显示维基百科信息框的内容，但它给了我一些问题。这是代码。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
<?php
//The folder where you uploaded simple_html_dom.php
require_once('simple_html_dom.php');

//Wikipedia page to parse
$html = file_get_html('https://en.wikipedia.org/wiki/Burger_King');

foreach ( $html->find ( 'table[class=infobox vcard]' ) as $element ) {

    $cells = $element->find('td');

    $i = 0;

    foreach($cells as $cell) {


        $left[$i] = $cell->plaintext;

        if (!(empty($left[$i]))) {

            $i = $i + 1;

        }

    }


    $cells = $element->find('th');

    $i = 0;

    foreach($cells as $cell) {

        $right[$i] = $cell->plaintext;

        if (!(empty($right[$i]))) {

            $i = $i + 1;

        }

    }


print_r ($right);

echo "<br><br><br>";

print_r ($left);

//If you want to know what kind of industry burger king is
//echo "Burger king is $right[2], $left[2]

}


?>

</body>
</html>

该代码不适用于https://en.wikipedia.org/wiki/United_Kingdom之类的任何其他网页，它可以使用https://en.wikipedia.org/wiki/Burger_King。这是我收到的错误消息致命错误：在第16行的C：\ wamp \ www \ MyApps \ Inbox.php中的非对象上调用成员函数find（）

Answer 1

我发现错误来自table [class = infobox vcard]，这只检索了table = infobox

的表内容

Answer 2

1：此代码不适合您，因为您尝试在class =“的国家/地区页面上使用class =”infobox vcard“（适用于公司）获取该表信息框地理vcard“。

2：因此，这不是唯一的问题，因为你的内存肯定不足。

替代

$html = file_get_html('https://en.wikipedia.org/wiki/United_Kingdom');

使用：

$url = 'https://en.wikipedia.org/wiki/United_Kingdom';

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);

$html = new simple_html_dom();
$html->load($curl_scraped_page, true, false);

你应该得到像

这样的东西

Fatal error: Out of memory (allocated XXX) (tried to allocate 40 bytes) 
in /simple_html_dom.php on line 1544

3：如果您能够修复以前的问题，您还必须更新您的代码，这可能无法正常工作

编辑1：

我最喜欢避免此问题的方法是使用谷歌缓存，它具有“仅文本”版本。这通常避免了存储大量数据的需要，这是使代码无法工作的一个原因。主要的缺点是谷歌缓存不知道与th有关，所以内部只是消失者。

我会寻找替代方案，同时这里是代码XD

<?php

require_once('simple_html_dom.php');
//$html = file_get_html('https://en.wikipedia.org/wiki/United_Kingdom');

    //q = website to fetch, leave "cache:"
    $url = 'http://webcache.googleusercontent.com/search?strip=1&q=cache:en.wikipedia.org/wiki/United_Kingdom';

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $curl_scraped_page = curl_exec($ch);

    $html = new simple_html_dom();
    $html->load($curl_scraped_page, true, false);


//echo $html;


foreach ( $html->find ( 'table[class=infobox geography vcard]' ) as $element ) {


    $cells = $element->find('td');

    $i = 0;

    foreach($cells as $cell) {


        $left[$i] = $cell->plaintext;

        if (!(empty($left[$i]))) {

            $i = $i + 1;

        }

    }


print_r ($left);

}


?>

如果我帮助了你（我确信我做到了），请标记为最佳答案并竖起大拇指：P

使用Simple Dom解析器获取wiki信息框的内容

2 个答案:

编辑1：