使用scrapy,规则和链接提取器刮取“旧”页面

时间:2018-06-10 19:29:48

标签: web-scraping scrapy rules

我一直在研究一个带scrapy的项目。在这个可爱的社区的帮助下,我成功地抓住了这个网站的第一页:http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav。我也试图从“旧”页面中提取信息。我研究过“crawlspider”,规则和链接提取器,并且相信我有正确的代码。我希望蜘蛛在后续页面上执行相同的循环。不幸的是,当我运行它时,它只是吐出第一页,而不是继续“旧”页面。

我不确定我需要改变什么,我真的很感激一些帮助。有些帖子一直追溯到2004年2月...我是数据挖掘的新手,并不确定能否刮掉每个帖子实际上是否真的是一个现实的目标。如果是我想的话。请任何帮助表示赞赏。谢谢!

import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor



class Roto_News_Spider2(crawlspider):
    name = "RotoPlayerNews"

    start_urls = [
        'http://www.rotoworld.com/playernews/nfl/football/',
    ]

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)


    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
            team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

3 个答案:

答案 0 :(得分:0)

我的建议:Selenium

如果您想自动更改页面,可以使用Selenium WebDriverSelenium使您能够与页面按钮进行交互,点击按钮,写入输入等。您需要更改代码以废弃data然后单击{{1 }按钮。然后,它会改变页面并继续刮擦。

older是一个非常有用的工具。我正在使用它,在个人项目上。您可以查看my repo on GitHub以查看其工作原理。对于您尝试废弃的网页,只需将Selenium更改为link,就无法使用旧版,因此,您需要使用scraped进行更改页。

希望它有所帮助。

答案 1 :(得分:0)

在当前情况下无需使用Selenium。在抓取之前,您需要在浏览器中打开URL并按F12检查代码并在“网络”选项卡中查看数据包。在您的情况下按next或“OLDER”时,您可以在“网络”选项卡中看到新的TCP数据包集。它为您提供所需的一切。当你了解它是如何工作的时候,你可以编写工作蜘蛛。

import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor



class Roto_News_Spider2(CrawlSpider):
    name = "RotoPlayerNews"

    start_urls = [
        'http://www.<DOMAIN>/playernews/nfl/football/',
    ]

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)


    def parse(self, response):
        for item in response.xpath("//div[@class='pb']"):
            player = item.xpath(".//div[@class='player']/a/text()").extract_first()
            position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
            team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
            report = item.xpath(".//div[@class='report']/p/text()").extract_first()
            date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
            impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
            source = item.xpath(".//div[@class='source']/a/text()").extract_first()
            yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

        older = response.css('input#cp1_ctl00_btnNavigate1')
        if not older:
            return

        inputs = response.css('div.aspNetHidden input')
        inputs.extend(response.css('div.RW_pn input'))

        formdata = {}
        for input in inputs:
            name = input.css('::attr(name)').extract_first()
            value = input.css('::attr(value)').extract_first()
            formdata[name] = value or ''

        formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
        formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
        del formdata['ctl00$cp1$ctl00$btnFilterResults']
        del formdata['ctl00$cp1$ctl00$btnNavigate1']

        action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'

        yield FormRequest(
            action_url,
            formdata=formdata,
            callback=self.parse
        )

小心你需要在我的代码中替换所有对象。

答案 2 :(得分:0)

如果您打算获取遍历多个页面的数据,则无需进行scrapy。如果您仍然想要任何与scrapy相关的解决方案,那么我建议您选择使用splash来处理分页。

我会做下面这样的事情来获取物品(假设你已经在机器中安装了硒):

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)

while True:
    for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='pb']"))):
        player = item.find_element_by_xpath(".//div[@class='player']/a").text
        player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
        print(player)

    try:
        idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='date']"))).text
        if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
            break
        wait.until(EC.presence_of_element_located((By.XPATH, "//input[@id='cp1_ctl00_btnNavigate1']"))).click()
        wait.until(EC.staleness_of(item))
    except:break

driver.quit()