Question

我正在使用Goutte制作一个网络刮板。

对于开发，我已经保存了一个我想浏览的.html文档（因此我不会经常向网站发出请求）。这就是我到目前为止所拥有的：

[]

根据我所知，应该在Symfony \ Component \ BrowserKit中调用请求，并传入原始正文数据。这是我收到的错误消息：

use Goutte\Client;

$client = new Client();
$html=file_get_contents('test.html');
$crawler = $client->request(null,null,[],[],[],$html);

如果我只是使用DomCrawler，使用字符串创建一个爬虫是非常重要的。（见：http://symfony.com/doc/current/components/dom_crawler.html）。我只是不确定如何与Goutte做同等的事情。

提前致谢。

Answer 1

您决定使用的工具会建立真正的http连接，并且不适合您想要的操作。至少开箱即用。

选项1：实施您自己的BrowserKit客户端

所有goutte都会扩展BrowserKit的Client。它使用Guzzle实现http请求。

要实现自己的客户端，您需要做的就是扩展Symfony\Component\BrowserKit\Client并提供doRequest() method：

use Symfony\Component\BrowserKit\Client;
use Symfony\Component\BrowserKit\Request;
use Symfony\Component\BrowserKit\Response;

class FilesystemClient extends Client
{
    /**
     * @param object $request An origin request instance
     *
     * @return object An origin response instance
     */
    protected function doRequest($request)
    {
        $file = $this->getFilePath($request->getUri());

        if (!file_exists($file)) {
            return new Response('Page not found', 404, []);
        }

        $content = file_get_contents($file);

        return new Response($content, 200, []);
    }

    private function getFilePath($uri)
    {
        // convert an uri to a file path to your saved response
        // could be something like this:
        return preg_replace('#[^a-zA-Z_\-\.]#', '_', $uri).'.html';
    }
}

 $client = new FilesystemClient();
 $client->request('GET', '/test');

客户端request()需要接受真实的URI，因此您需要实现自己的逻辑将其转换为文件系统位置。

请查看Goutte's Client的内容。

选项2：实施自定义Guzzle处理程序

由于Goutte使用Guzzle，你可以提供自己的Guzzle处理程序来加载文件的响应，而不是发出真正的http请求。看看handlers and middleware doc。

如果您在缓存响应后刚刚提交了较少的http请求，那么Guzzle已经为此提供了支持。

选项3：直接使用DomCrawler

new Crawler(file_get_contents('test.html'))

唯一的缺点是您将松开BrowserKit客户端的一些便捷方法，例如click()或selectLink()。

使用goutte从文件/字符串中读取

1 个答案: