我使用scrapy创建爬虫。并创建一些脚本来抓取许多页面。
不幸的是,并非所有脚本都抓取所有页面。某些页面返回所有页面,而其他页面只返回23或180(每个URL的结果不同)。
import scrapy
class BotCrawl(scrapy.Spider)
name = "crawl-bl2"
start_urls = [
'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93',
]
def parse(self, response):
for product in response.css("ul[class='products row-grid']"):
for product in product.css('li'):
yield {
'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
'kota': product.css('div[class=user-city] a::text').extract(),
'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
}
# next page
next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
它阻止了http请求,或者我的代码可能出现错误?
经过Granitosaurus
编辑后的更新代码仍有错误
import scrapy
class BotCrawl(scrapy.Spider):
name = "crawl-bl2"
start_urls = [
'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93',
]
def parse(self, response):
products = response.css('article.product-display')
for product in products:
yield {
'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
'kota': product.css('div[class=user-city] a::text').extract(),
'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
}
# next page
next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first()
last_url = "/c/perawatan-kecantikan/perawatan-wajah?page=100&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93"
if next_page_url is not last_url:
yield scrapy.Request(response.urljoin(next_page_url),dont_filter=True)
谢谢
答案 0 :(得分:1)
您的产品xpath有点不可靠。直接尝试精选的产品文章,该网站使您可以轻松地使用css选择器:
products = response.css('article.product-display')
for product in products:
yield {
'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
'kota': product.css('div[class=user-city] a::text').extract(),
'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
}
您可以通过插入inspect_response
来调试响应:
def parse(self, response):
products = response.css('article.product-display')
if not products:
from scrapy.shell import inspect_response
inspect_response(response, self)
# will open up python shell here where you can check `response` object
# try `view(response)` to open it up in your browser and such.