您好我是Python和抓取新手。我一直在研究和浏览Stackoverflow,并提出了Python + Selenium来打开Webdriver来打开URL并获取页面源并将其转换为我需要的数据。但是,我知道有更好的方法(例如,不用硒硒,不必刮取页面来源,将数据发布到ASP等),我希望我可以在这里寻求一些帮助,用于教育目的。 / p>
这就是我想要实现的目标。
在您进入我的代码之前,这里有一些背景信息。 Asos是一个使用分页的网站,因此这与通过多次搜索有关。另外,我试着没有Selenium发布到http://www.asos.com/services/srvWebCategory.asmx/GetWebCategories 有了这些数据:
{'cid':'2623', 'strQuery':"", 'strValues':'undefined', 'currentPage':'0',
'pageSize':'204','pageSort':'-1','countryId':'10085','maxResultCount':''}
但我得到的回报没有。
我知道我的方法并不好,我非常感谢任何帮助/推荐/方法/想法!谢谢!
import scrapy
import time
import logging
from random import randint
from selenium import webdriver
from asos.items import ASOSItem
from scrapy.selector import Selector
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class ASOSSpider(scrapy.Spider):
name = "asos"
allowed_domains = ["asos.com"]
start_urls = [
"http://www.asos.com/Women/New-In-Clothing/Cat/pgecategory.aspx?cid=2623#/parentID=-1&pge=0&pgeSize=204&sort="
]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
view_204 = self.driver.find_element_by_xpath("//div[@class='product-count-bottom']/a[@class='view-max-paged']")
view_204.click() #click and show 204 pictures
time.sleep(5) #wait till 204 images loaded, I've also tried the explicit wait, but i got timedout
# element = WebDriverWait(self.driver, 8).until(EC.presence_of_element_located((By.XPATH, "category-controls bottom")))
logging.debug("wait time has reached! go CRAWL!")
next = self.driver.find_element_by_xpath("//li[@class='page-skip']/a")
pageSource = Selector(text=self.driver.page_source) # load page source instead, cant seem to crawl the page by just passing the reqular request
for sel in pageSource.xpath("//ul[@id='items']/li"):
item = ASOSItem()
item["product_title"] = sel.xpath("a[@class='desc']/text()").extract()
item["product_link"] = sel.xpath("a[@class='desc']/@href").extract()
item["product_price"] = sel.xpath("div/span[@class='price']/text()").extract()
item["product_img"] = sel.xpath("div/a[@class='productImageLink']/img/@src").extract()
yield item
next.click()
self.driver.close()