我正在使用scrapy从某些网站提取数据。问题是我的蜘蛛只能抓取初始start_urls的网页,它无法抓取网页中的网址。 我完全复制了同一个蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from scrapy.utils.url import urljoin_rfc
from nextlink.items import NextlinkItem
class Nextlink_Spider(BaseSpider):
name = "Nextlink"
allowed_domains = ["Nextlink"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/@href')
for site in sites:
relative_url = site.extract()
url = self._urljoin(response,relative_url)
yield Request(url, callback = self.parsetext)
def parsetext(self, response):
log = open("log.txt", "a")
log.write("test if the parsetext is called")
hxs = HtmlXPathSelector(response)
items = []
texts = hxs.select('//div').extract()
for text in texts:
item = NextlinkItem()
item['text'] = text
items.append(item)
log = open("log.txt", "a")
log.write(text)
return items
def _urljoin(self, response, url):
"""Helper to convert relative urls to absolute"""
return urljoin_rfc(response.url, url, response.encoding)
我使用log.txt来测试是否调用了parsetext。但是,在我运行我的蜘蛛之后,log.txt中没有任何内容。
答案 0 :(得分:1)
见这里:
allowed_domains
包含允许此蜘蛛抓取的域的字符串的可选列表。如果启用OffsiteMiddleware,则不会遵循不属于此列表中指定的域名的URL的请求。
因此,只要您未在设置中激活OffsiteMiddleware,就无所谓了,您可以完全退出allowed_domains
。
检查settings.py是否已激活OffsiteMiddleware。如果您希望允许Spider在任何域上进行爬网,则不应激活它。
答案 1 :(得分:1)
我认为问题是,您没有告诉Scrapy跟踪每个已抓取的网址。对于我自己的博客,我实现了一个CrawlSpider,它使用基于LinkExtractor的规则从我的博客页面中提取所有相关链接:
# -*- coding: utf-8 -*-
'''
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* @author Marcel Lange <info@ask-sheldon.com>
* @package ScrapyCrawler
'''
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import Crawler.settings
from Crawler.items import PageCrawlerItem
class SheldonSpider(CrawlSpider):
name = Crawler.settings.CRAWLER_NAME
allowed_domains = Crawler.settings.CRAWLER_DOMAINS
start_urls = Crawler.settings.CRAWLER_START_URLS
rules = (
Rule(
LinkExtractor(
allow_domains=Crawler.settings.CRAWLER_DOMAINS,
allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
deny=Crawler.settings.CRAWLER_DENY_REGEX,
restrict_css=Crawler.settings.CSS_SELECTORS,
canonicalize=True,
unique=True
),
follow=True,
callback='parse_item',
process_links='filter_links'
),
)
# Filter links with the nofollow attribute
def filter_links(self, links):
return_links = list()
if links:
for link in links:
if not link.nofollow:
return_links.append(link)
else:
self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
return return_links
def parse_item(self, response):
# self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
item = PageCrawlerItem()
item['status'] = response.status
item['title'] = response.xpath('//title/text()')[0].extract()
item['url'] = response.url
item['headers'] = response.headers
return item
在https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/我已经详细描述了我如何实施网站抓取工具来预热我的Wordpress全页缓存。
答案 2 :(得分:0)
我的猜测就是这一行:
allowed_domains = ["Nextlink"]
这不是像domain.tld这样的域名,所以它会拒绝任何链接。
如果您从the documentation:allowed_domains = ["dmoz.org"]