Question

这是我的代码。蜘蛛不会抓取网址或不提取它们或类似的东西。如果我在＆＃34中定位网址，则启动网址＆＃34;然后scrapy找到项目，但不会向前爬，如果我但是＆＃34;开始网址＆＃34;包含目标列表的url然后结果为0。 :)我希望文字不会混淆

from scrapy.spiders import Spider
from testing.items import TestingItem
import scrapy

class MySpider(scrapy.Spider):
  name            = 'testing'
  allowed_domains = ['http://somewebsite.com']
  start_urls      = ['http://somewebsite.com/listings.php']


  def parse(self, response):
      for href in response.xpath('//h5/a/@href'):
          full_url = response.urljoin(href.extract())
          yield scrapy.Request(full_url, callback=self.parse_item)


  def parse_item(self, response):
    titles = response.xpath('//*[@class="panel-content user-info"]').extract()
    for title in titles:
      item = TestingItem()
      item["nimi"] = response.xpath('//*[@class="seller-info"]/h3/text()').extract()

      yield item

Answer 1

您需要移除http://中的allowed_domains。

要回答您的评论，对于pagination，您可以使用Rules，我会让您查看文档here。它可以让你轻松地完成分页。

小例子：

rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('xpath/to/nextpage/button',)), callback="parse", follow= True),)

希望这有帮助。

Scrapy只抓取一页

1 个答案: