scrapy爬行一组可能包含下一页的链接

时间:2018-02-04 13:45:13

标签: python selenium-webdriver web-scraping scrapy

我想:

  1. 提取特定页面的链接
  2. 对于每个链接,我需要一些链接的内容,以及'下一页的内容'那个链接。
  3. 然后将其导出为json文件(就我的问题而言,这并不重要)
  4. 目前我的蜘蛛是这样的:

    class mySpider(scrapy.Spider):
         ...
        def parse(self, response):
            for url in someurls:
                yield scrapy.Request(url=url, callback=self.parse_next)
    
        def parse_next(self, response):
            for selector in someselectors:
                yield { 'contents':...,
                         ...}
            nextPage = obtainNextPage()
            if nextPage:
                yield scrapy.Request(url=next_url, callback=self.parse_next)
    

    问题在于蜘蛛处理的一组链接,蜘蛛只能到达下一页'对于那组链接的最后一个链接,我通过 selenium + chromedriver 查看了该链接。例如,我有10个链接(从No.1到No.10),我的蜘蛛只能获得No.10链接的下一页。我不知道问题是否发生是因为蜘蛛的结构问题。以下是完整代码:

    import scrapy
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import time
    
    
    class BaiduSpider(scrapy.Spider):
        name = 'baidu'
        allowed_domains = ['baidu.com']
        start_urls = ['http://tieba.baidu.com']
        main_url = 'http://tieba.baidu.com/f?kw=%E5%B4%94%E6%B0%B8%E5%85%83&ie=utf-8'
        username = ""
        password = ""
    
        def __init__(self, username=username, password=password):
            #options = webdriver.ChromeOptions()
            #options.add_argument('headless')
            #options.add_argument('window-size=1200x600')
            self.driver = webdriver.Chrome()#chrome_options=options)
            self.username = username
            self.password = password
        # checked
        def logIn(self):
            elem = self.driver.find_element_by_css_selector('#com_userbar > ul > li.u_login > div > a')
            elem.click()
            wait = WebDriverWait(self.driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,'#TANGRAM__PSP_10__footerULoginBtn')))
            elem = self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__footerULoginBtn')
            elem.click()
            elem = self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__userName')
            elem.send_keys(self.username)
            elem = self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__password')
            elem.send_keys(self.password)
            self.driver.find_element_by_css_selector('#TANGRAM__PSP_10__submit').click()
        # basic checked
        def parse(self, response):
            self.driver.get(response.url)
            self.logIn()
            # wait for hand input verify code
            time.sleep(15)
            self.driver.get('http://tieba.baidu.com/f?kw=%E5%B4%94%E6%B0%B8%E5%85%83&ie=utf-8')
            for url in self.driver.find_elements_by_css_selector('a.j_th_tit')[:2]:
                #new_url = response.urljoin(url)
                new_url = url.get_attribute("href")
                yield scrapy.Request(url=new_url, callback=self.parse_next)
        # checked
        def pageScroll(self, url):
            self.driver.get(url)
            SCROLL_PAUSE_TIME = 0.5
            SCROLL_LENGTH = 1200
            page_height = int(self.driver.execute_script("return document.body.scrollHeight"))
            scrollPosition = 0
            while scrollPosition < page_height:
                scrollPosition = scrollPosition + SCROLL_LENGTH
                self.driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
                time.sleep(SCROLL_PAUSE_TIME)
            time.sleep(1.2)
    
        def parse_next(self, response):
            self.log('I visited ' + response.url)
            self.pageScroll(response.url)
    
            for sel in self.driver.find_elements_by_css_selector('div.l_post.j_l_post.l_post_bright'):
                name = sel.find_element_by_css_selector('.d_name').text
                try:
                    content = sel.find_element_by_css_selector('.j_d_post_content').text
                except: content = ''
    
                try: reply = sel.find_element_by_css_selector('ul.j_lzl_m_w').text
                except: reply = ''
                yield {'name': name, 'content': content, 'reply': reply}
    
            #follow to next page
    
            next_sel = self.driver.find_element_by_link_text("下一页")
            next_url_name = next_sel.text
    
            if next_sel and next_url_name == '下一页':
                next_url = next_sel.get_attribute('href')
    
                yield scrapy.Request(url=next_url, callback=self.parse_next)
    

    感谢您的帮助,并欢迎任何参考上述代码的建议

1 个答案:

答案 0 :(得分:1)

关于从一页上抓取内容,请进行存储,然后让Spider继续爬网到抓取并将项目存储在后续页面上。您应该使用项目名称配置items.py文件,并使用meta将项目传递给每个scrapy.Request。

您应该签出https://github.com/scrapy/scrapy/issues/1138

为了说明它是如何工作的,它像这样... 1.首先,我们设置item.py文件,并在每页上抓取所有项目。

#items.py
import scrapy

class ScrapyProjectItem(scrapy.Item):
    page_one_item = scrapy.Field()
    page_two_item = scrapy.Field()
    page_three_item = scrapy.Field()

然后将其导入items.py物品类到您抓抓蜘蛛中。

from scrapyproject.items import ScrapyProjectItem

在抓取器中,通过具有所需内容的每次页面迭代,它都会初始化items.py类,然后使用“元”将这些项目传递给下一个请求。

#spider.py
def parse(self, response):
    # Initializing the item class
    item = ScrapyProjectItem()
    # Itemizing the... item lol
    item['page_one_item'] = response.css("etcetc::").extract() # set desired attribute
    # Here we pass the items to the next concurrent request
    for url in someurls: # Theres a million ways to skin a cat, dont know your exact use case.
        yield scrapy.Request(response.urljoin(url),
                             callback=self.parse_next, meta={'item': item})

def parse_next(self, response):
    # We load the meta from the previous request
    item = response.meta['item']
    # We itemize
    item['page_two_item'] = response.css("etcetc::").extract()
    # We pass meta again to next request
    for url in someurls:
        yield scrapy.Request(response.urljoin(url),
                             callback=self.parse_again, meta={'item': item})

def parse_again(self, response):
    # We load the meta from the previous request
    item = response.meta['item']
    # We itemize
    item['page_three_item'] = response.css("etcetc::").extract()
    # We pass meta again to next request
    for url in someurls:
        yield scrapy.Request(response.urljoin(url),
                             callback=self.parse_again, meta={'item': item})
    # At the end of each iteration of the crawl loop we can yield the result
    yield item

关于爬虫仅到达最后一个链接的问题,我想获取更多信息,而不是猜测问题可能出在哪里。在您的“ parse_next”中,您应该添加“ print(response.url)”以查看是否完全到达了页面?对不起,如果我不明白您的问题,浪费了大家的时间,大声笑。

编辑

我想我对您的问题理解得更好...您有一个网址列表,每个网址都有自己的一组网址,是吗?

在您的代码中,“ obtainNextPage()”可能是问题所在?我过去曾经遇到过这种情况,不得不使用一些xpath和/或regex魔术来正确获取下一页。我不确定“ obtainNextPage”在做什么,但是...您是否考虑过解析内容并使用选择器查找下一页?例如。

class mySpider(scrapy.Spider):
     ...
    def parse(self, response):
        for url in someurls:
            yield scrapy.Request(url=url, callback=self.parse_next)

    def parse_next(self, response):
        for selector in someselectors:
            yield { 'contents':...,
                     ...}
        #nextPage = obtainNextPage()
        next_page = response.xpath('//path/to/nextbutton/orPage'):
        if next_page is not None:
            yield scrapy.Request(response.urljoin(next_page),
                                 callback=self.parse_next)

您仍应添加“ print(response.url)”,以查看所请求的url是否被正确调用,可能是urljoin问题。