我一直在研究一个带scrapy的项目。在这个可爱的社区的帮助下,我成功地抓住了这个网站的第一页:http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav。我也试图从“旧”页面中提取信息。我研究过“crawlspider”,规则和链接提取器,并且相信我有正确的代码。我希望蜘蛛在后续页面上执行相同的循环。不幸的是,当我运行它时,它只是吐出第一页,而不是继续“旧”页面。
我不确定我需要改变什么,我真的很感激一些帮助。有些帖子一直追溯到2004年2月...我是数据挖掘的新手,并不确定能否刮掉每个帖子实际上是否真的是一个现实的目标。如果是我想的话。请任何帮助表示赞赏。谢谢!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[@class='pb']"):
player = item.xpath(".//div[@class='player']/a/text()").extract_first()
position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[@class='report']/p/text()").extract_first()
date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[@class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
答案 0 :(得分:0)
我的建议:Selenium
如果您想自动更改页面,可以使用Selenium WebDriver。
Selenium
使您能够与页面按钮进行交互,点击按钮,写入输入等。您需要更改代码以废弃data
然后单击{{1 }按钮。然后,它会改变页面并继续刮擦。
older
是一个非常有用的工具。我正在使用它,在个人项目上。您可以查看my repo on GitHub以查看其工作原理。对于您尝试废弃的网页,只需将Selenium
更改为link
,就无法使用旧版,因此,您需要使用scraped
进行更改页。
希望它有所帮助。
答案 1 :(得分:0)
在当前情况下无需使用Selenium。在抓取之前,您需要在浏览器中打开URL并按F12检查代码并在“网络”选项卡中查看数据包。在您的情况下按next或“OLDER”时,您可以在“网络”选项卡中看到新的TCP数据包集。它为您提供所需的一切。当你了解它是如何工作的时候,你可以编写工作蜘蛛。
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(CrawlSpider):
name = "RotoPlayerNews"
start_urls = [
'http://www.<DOMAIN>/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[@id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)
def parse(self, response):
for item in response.xpath("//div[@class='pb']"):
player = item.xpath(".//div[@class='player']/a/text()").extract_first()
position= item.xpath(".//div[@class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[@class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[@class='report']/p/text()").extract_first()
date = item.xpath(".//div[@class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[@class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[@class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
older = response.css('input#cp1_ctl00_btnNavigate1')
if not older:
return
inputs = response.css('div.aspNetHidden input')
inputs.extend(response.css('div.RW_pn input'))
formdata = {}
for input in inputs:
name = input.css('::attr(name)').extract_first()
value = input.css('::attr(value)').extract_first()
formdata[name] = value or ''
formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
yield FormRequest(
action_url,
formdata=formdata,
callback=self.parse
)
小心你需要在我的代码中替换所有对象。
答案 2 :(得分:0)
如果您打算获取遍历多个页面的数据,则无需进行scrapy。如果您仍然想要任何与scrapy相关的解决方案,那么我建议您选择使用splash来处理分页。
我会做下面这样的事情来获取物品(假设你已经在机器中安装了硒):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)
while True:
for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='pb']"))):
player = item.find_element_by_xpath(".//div[@class='player']/a").text
player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
print(player)
try:
idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='date']"))).text
if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
break
wait.until(EC.presence_of_element_located((By.XPATH, "//input[@id='cp1_ctl00_btnNavigate1']"))).click()
wait.until(EC.staleness_of(item))
except:break
driver.quit()