Question

我使用以下代码并成功从特定页面收集数据，如下所示：

    include 'simplehtmldom/simple_html_dom.php';

    $html = file_get_html('http://test.com/file/1209i0329/');

    // Find all article blocks
    foreach($html->find('div.Content') as $file) {
        $item['date']     = $file->find('id.article-date', 0)->plaintext;
        $item['location']    = $file->find('id.article-location', 0)->plaintext;
        $item['price'] = $file->find('div.article', 0)->plaintext;
        $files[] = $item;
    }

    print_r($files);

该代码适用于http://test.com/file/1209i0329.php，但我的目标是从此域中以http://test.com/file/开头的所有网页收集数据（例如，http://test.com/file/1209i0329/，http://test.com/file/120dnkj329/，等等）。有没有使用simle_html_dom克服此问题的解决方案？

Answer 1

我不知道你在哪里搜索你的文件（同一个域或外面），你可能需要循环一个包含你想要搜索的网址的数组。

考虑这个例子：

include 'simplehtmldom/simple_html_dom.php';

// most likely this process will take some time

$files = array();
$urls = array(
    'http://test.com/file/1209i0329/',
    'http://test.com/file/120dnkj329/',
    'http://en.wikipedia.org/wiki/',
);

foreach($urls as $url) {

    $html = file_get_html($url);

    // Find all article blocks
    foreach($html->find('div.Content') as $file) {
        $item['date']     = $file->find('id.article-date', 0)->plaintext;
        $item['location']    = $file->find('id.article-location', 0)->plaintext;
        $item['price'] = $file->find('div.article', 0)->plaintext;
        $files[] = $item;
    }

}

print_r($files);

从多个页面使用Simple HTML Dom收集Web数据

1 个答案: