我想抓取this website。我写了一个蜘蛛,但它只是爬到头版,即前52个项目。
我试过这段代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
a=[]
from aqaq.items import aqaqItem
import os
import urlparse
import ast
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/womens-tops/",
]
def parse(self, response):
# ... Extract items in the page using extractors
n=3
ct=1
hxs = HtmlXPathSelector(response)
sites=hxs.select('//div[@id="page"]')
for site in sites:
name=site.select('//div[@id="content"]/div[@class="l-pageWrapper"]/div[@class="l-main"]/div[@class="box box-bgcolor"]/section[@class="box-bd pan mtm"]/ul[@id="productsCatalog"]/li/a/@href').extract()
print name
print ct
ct=ct+1
a.append(name)
req= Request (url="http://www.jabong.com/women/clothing/womens-tops/?page=" + str(n) ,
headers = {"Referer": "http://www.jabong.com/women/clothing/womens-tops/",
"X-Requested-With": "XMLHttpRequest"},callback=self.parse,dont_filter=True)
return req # and your items
显示以下输出:
2013-10-31 09:22:42-0500 [jabong] DEBUG: Crawled (200) <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> (referer: http://www.jabong.com/women/clothing/womens-tops/)
2013-10-31 09:22:42-0500 [jabong] DEBUG: Filtered duplicate request: <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-10-31 09:22:42-0500 [jabong] INFO: Closing spider (finished)
2013-10-31 09:22:42-0500 [jabong] INFO: Dumping Scrapy stats:
当我放dont_filter=True
时,它永远不会停止。
答案 0 :(得分:5)
是的,此处必须使用dont_filter
,因为每次将页面向下滚动到底部时,XHR请求中只有page
GET参数更改为http://www.jabong.com/women/clothing/womens-tops/?page=X
。
现在您需要弄清楚如何停止抓取。这实际上很简单 - 只检查队列中下一页上没有产品并引发CloseSpider
exception。
这是一个适合我的完整代码示例(在第234页停止):
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request
class Product(scrapy.Item):
brand = scrapy.Field()
title = scrapy.Field()
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/womens-tops/?page=1",
]
page = 1
def parse(self, response):
products = response.xpath("//li[@data-url]")
if not products:
raise CloseSpider("No more products!")
for product in products:
item = Product()
item['brand'] = product.xpath(".//span[contains(@class, 'qa-brandName')]/text()").extract()[0].strip()
item['title'] = product.xpath(".//span[contains(@class, 'qa-brandTitle')]/text()").extract()[0].strip()
yield item
self.page += 1
yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%d" % self.page,
headers={"Referer": "http://www.jabong.com/women/clothing/womens-tops/", "X-Requested-With": "XMLHttpRequest"},
callback=self.parse,
dont_filter=True)
答案 1 :(得分:2)
您可以尝试使用此代码,与alecxe
代码略有不同,
如果没有产品,那么只需从功能中return
,最终导致关闭蜘蛛。简单的解决方案。
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import Spider
from scrapy.http import Request
class aqaqItem(scrapy.Item):
brand = scrapy.Field()
title = scrapy.Field()
class aqaqspider(Spider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = ["http://www.jabong.com/women/clothing/womens-tops/?page=1"]
page_index = 1
def parse(self, response):
products = response.xpath("//li[@data-url]")
if products:
for product in products:
brand = product.xpath(
".//span[contains(@class, 'qa-brandName')]/text()").extract()
brand = brand[0].strip() if brand else 'N/A'
title = product.xpath(
".//span[contains(@class, 'qa-brandTitle')]/text()").extract()
title = title[0].strip() if title else 'N/A'
item = aqaqItem()
item['brand']=brand,
item['title']=title
yield item
# here if no products are available , simply return, means exiting from
# parse and ultimately stops the spider
else:
return
self.page_index += 1
if page_index:
yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%s" % (self.page_index + 1),
callback=self.parse)
即使蜘蛛产生超过12.5k的产品,它包含大量重复的条目,我已经ITEM_PIPELINE
删除重复的条目并插入mongodb。
管道代码,
from pymongo import MongoClient
class JabongPipeline(object):
def __init__(self):
self.db = MongoClient().jabong.product
def isunique(self, data):
return self.db.find(data).count() == 0
def process_item(self, item, spider):
if self.isunique(dict(item)):
self.db.insert(dict(item))
return item
并在此处附加scrapy日志状态
2015-04-19 10:00:58+0530 [jabong] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 426231,
'downloader/request_count': 474,
'downloader/request_method_count/GET': 474,
'downloader/response_bytes': 3954822,
'downloader/response_count': 474,
'downloader/response_status_count/200': 235,
'downloader/response_status_count/301': 237,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 4, 19, 4, 30, 58, 710487),
'item_scraped_count': 12100,
'log_count/DEBUG': 12576,
'log_count/INFO': 11,
'request_depth_max': 234,
'response_received_count': 235,
'scheduler/dequeued': 474,
'scheduler/dequeued/memory': 474,
'scheduler/enqueued': 474,
'scheduler/enqueued/memory': 474,
'start_time': datetime.datetime(2015, 4, 19, 4, 26, 17, 867079)}
2015-04-19 10:00:58+0530 [jabong] INFO: Spider closed (finished)
答案 2 :(得分:0)
如果您在该页面上打开开发者控制台,您会看到在webrequest中返回页面内容:
http://www.jabong.com/home-living/furniture/new-products/?page=1
这将返回包含其中所有项目的HTML文档。因此,我只是递增页面的值并解析它,直到返回HTML与先前返回的HTML相等。
答案 3 :(得分:0)
除非出现错误响应,否则每次使用dont_filter
并发出新请求确实会永远运行。
在浏览器中浏览无限滚动,看看没有更多页面时的响应是什么。然后,在蜘蛛中,通过不发出新请求来处理这种情况。
答案 4 :(得分:-2)
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,'http://www.jabong.com/women/clothing/womens-tops/?page=3');
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0');
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('X-Requested-With: XMLHttpRequest'));
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$htmldata = curl_exec($curl_handle);
curl_close($curl_handle);
它为我工作。请通过PHP Curl打电话