我正在尝试构建广泛的连续爬网程序,并且能够提取链接,但是无法对它们进行爬网并提取那些链接。该项目的最终目标是对.au域进行爬网并将其根URL添加到数据库中。
class Crawler (scrapy.Spider):
name = "crawler"
rules = (Rule(LinkExtractor(allow='.com'), callback='parse_item'))
#This will be changed to allow .au before deployment to only crawl .au sites.
start_urls = [
"http://quotes.toscrape.com/",
]
def parse(self, response):
urls = response.xpath("//a/@href")
for u in urls:
l = ItemLoader(item=Link(), response=response)
l.add_xpath('url', './/a/@href')
return l.load_item()
我遇到的另一个问题是,对于内部链接,它添加了一个相对的URL路径,而不是一个绝对的URL路径。我已经尝试通过本节修复它。
urls = response.xpath("//a/@href")
for u in urls:
items.py文件:
class Link(scrapy.Item):
url = scrapy.Field()
pass
答案 0 :(得分:1)
我设法弄清楚了,我在下面发布了基本代码,以帮助以后遇到相同问题的任何人。
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#Create a list of sites not to crawl.
#Best to read this from a file containing top 100 sites for example.
denylist = [
'google.com',
'yahoo.com',
'youtube.com'
]
class Crawler (CrawlSpider): #For broad crawl you need to use "CrawlSpider"
name = "crawler"
rules = (Rule(LinkExtractor(allow=('.com', ),
deny=(denylist)), follow=True, callback='parse_item'),)
start_urls = [
"http://quotes.toscrape.com",
]
def parse_item(self, response):
# self.logger.info('LOGGER %s', response.url)
# use above to log and see info in the terminal
yield {
'link': response.url
}
答案 1 :(得分:0)
ItemLoaders
对于创建需要Processors
的项目对象很有用。
从您的代码中,我看不到需要使用它们。您可以简单地yield
Request
对象。
您可以摆脱Link
类(注意:当块中没有其他内容时,pass
用作占位符。因此,代码中的pass
不会)有道理
def parse(self, response):
urls = response.xpath("//a/@href")
for u in urls:
yield scrapy.Request(u, callback=self.your_callback_method)
希望有帮助