我正在尝试使用Scrapy和Splash解析此Target search page上的产品名称。我使用Splash发送请求yield SplashRequest(url=i, callback=self.parse, headers = {"User-Agent": ua.chrome})
,然后使用parse函数提取product_name
:
def parse(self, response):
print("INSIDE PARSE TARGET")
for product in response.xpath('//div[@data-test="productGridContainer"]/div[2]/ul/li//div[@data-test="product-card"]'):
print("in PRODUCT")
print(product)
product_name = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@aria-label').extract_first()
print("Product name: " + str(product_name))
print("ratio: " + str(fuzz.partial_ratio(target_name.lower(), product_name.lower())))
if fuzz.partial_ratio(target_name.lower(), product_name.lower()) > self.max_score:
self.max_score = fuzz.partial_ratio(target_name.lower(), product_name.lower())
self.product_page = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@href').extract_first()
print("product_page: " + self.product_page)
print("---------------------------------------")
print("***********************************")
print("max_score is: " + str(self.max_score))
self.product_page = response.urljoin(self.product_page)
print("FOUND PRODUCT AT PAGE: " + self.product_page)
yield SplashRequest(url=self.product_page, callback=self.parseProduct, headers = {"User-Agent": ua.chrome})
但是,这就是我所得到的。它永远不会进入for循环,我不明白。
2018-08-01 14:08:04 [scrapy.core.engine] INFO: Spider opened
2018-08-01 14:08:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-01 14:08:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6044
2018-08-01 14:08:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.target.com/s?searchTerm=google+home+%2B via http://localhost:8050/render.html> (referer: None)
INSIDE PARSE TARGET
***********************************
max_score is: 0
FOUND PRODUCT AT PAGE: https://www.target.com/s?searchTerm=google+home+%2B
2018-08-01 14:08:07 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.target.com/s?searchTerm=google+home+%2B> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-08-01 14:08:07 [scrapy.core.engine] INFO: Closing spider (finished)
答案 0 :(得分:0)
您的搜寻器中没有循环。如该日志行所示:
调试:已过滤的重复请求:https://www.target.com/s?searchTerm=google+home+%2B>-将不再显示重复项(请参阅DUPEFILTER_DEBUG以显示所有重复项)
您正尝试再次爬网刚刚爬过的页面,而scrapy的dupe过滤器正在过滤掉此请求。
似乎您的self.product_page
返回的网址与您相同,而不是新的网址。我对您的代码进行了一些重构,以尝试理解您的问题:
def parse(self, response):
products = response.xpath('//div[@data-test="productGridContainer"]/div[2]/ul/li//div[@data-test="product-card"]')
max_score = 0
target_name = '???'
product_page = None
for product in products:
name = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@aria-label').extract_first()
url = product.xpath('.//div[@data-test="productCardBody"]/div[@data-test="product-details"]/div[contains(@class,"ProductTitle")]/a[1]/@href').extract_first()
if response.urljoin(url) == response.url:
continue # avoid crawling current page
ratio = fuzz.partial_ratio(target_name.lower(), name.lower()))
if ratio > self.max_score:
max_score = ratio
product_page = url
if product_page:
print(f'max_score: {max_score}')
print(f'product: {product_page}')
yield SplashRequest(response.urljoin(product_page),
callback=self.parse_product,
headers = {"User-Agent": ua.chrome})