Question

在我的localhost文档根目录中：

crawl.html

<html>
<body>
<p>
<form action="welcome.php" method="get">
Site to crawl: <input type="text" name="crawlThis">
<input type="submit">
</form>
</p>

</body>
</html>

的welcome.php

 <html>
 <body>

 <?php 
 include ("crawler.php");

 echo $crawl = new Crawler($_GET["crawlThis"]);

 $images = $crawl->get("images");

 $links = $crawl->get("links"); 

 echo $links;
 echo $images;

 ?>
 <br>

</body>
</html>

和crawler.php

<?php

class Crawler {

protected $markup = '';

public function __construct($uri) {

$this->markup = $this->getMarkup($uri);

}

public function getMarkup($uri) {

return file_get_contents($uri);

}

public function get($type) {

$method = "_get_{$type}";

if (method_exists($this, $method)){

return call_user_method($method, $this);

}

}

protected function _get_images() {

if (!empty($this->markup)){

preg_match_all('/<img([^>]+)\/>/i', $this->markup, $images);

return !empty($images[1]) ? $images[1] : FALSE;

}

}

protected function _get_links() {

if (!empty($this->markup)){

preg_match_all('/<a([^>]+)\>(.*?)\<\/a\>/i', $this->markup, $links);

return !empty($links[1]) ? $links[1] : FALSE;

}

}

}


/*$crawl = new Crawler($);

$images = $crawl->get('images');

$links = $crawl->get('links');*/

?>

结果页面只是空的。无法弄清楚我是否只是无法回显$ images，或者我的逻辑是否错误。我期待一个图像列表，然后是一个链接列表。

另外，我是否必须包含crawler.php或php将在其容器目录中搜索同名的类？

很抱歉，从Java中获取PHP是一种思维方式。

Answer 1

您正在使用某种类型的重音引号字符，例如”和‘

这些在php中不是有效的引号字符。您需要使用常规报价，例如"和'

另外，在考虑编写更多代码之前，您应该配置php以向您显示错误和通知。

Answer 2

我是自己写的，但是有很多文档化的例子可以做到这一点。以下是您可以关注或使用的一个很好的例子：

crawler example

为什么这个PHP爬虫没有工作？

在我的localhost文档根目录中：

2 个答案: