我是python的新手,我在使用 scrapy 跟踪网址时遇到了一些麻烦。我怀疑它可能符合xpath
规范,但是在对该主题做了几个教程后,我并没有接近解决这个问题。它遍历引用表中的url并重复地从起始页面中抓取内容。我究竟做错了什么?
附加代码:
import scrapy
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class MySpider(CrawlSpider):
name = 'unespider'
allowed_domains = ['https://my.une.edu.au/']
start_urls = ['https://my.une.edu.au/courses/']
rules = Rule(LinkExtractor(canonicalize = True, unique = True), follow = True, callback = "parse"),
def parse(self, response):
hxs = Selector(response)
for url in response.xpath('//*'):
yield {
'title': url.xpath('//*[@id="main-content"]/div/h2/a/text()').extract_first(),
'avail': url.xpath('//*[@id="overviewTab-snapshotDiv"]/p[3]/a/text()').extract_first(),
}
for url in hxs.xpath('//tr/td/a/@href').extract():
yield Request(response.urljoin(url), callback=self.parse)
答案 0 :(得分:0)
**更新我看到你想要的和更新的代码,现在每年都会跟随并输出正确的**
我很抱歉,我不确定你要从起始页面开始遵循什么,以及这是什么// * [@ id =" overviewTab-snapshotDiv"]。我无法找到xpath。我想帮助你更多,因为我还很新,编程和scrapy一开始很难,我最后制作了自己的类刮刀,所以我可以按照自己的方式去做,即使我和#39;我肯定scrapy更好:) 我已经完成了扫描你的标题和网址的代码,我评论了规则,因为我不知道你想要遵循什么或为什么。
import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from overflowQuestion2 import items #make sure to import items.py
class DavesimSpider(scrapy.Spider):
name = 'daveSim'
allowed_domains = ['my.une.edu.au']
start_urls = ['http://my.une.edu.au/courses/2007', ] #start at 2007
rules = Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse")
def parse(self, response): #This will scrape links and follow
#Grab the main Div wrapping the links
divLinkWrapper = response.xpath('//div[@class="pagination"]')
for links in divLinkWrapper: #for every element extract links
theLinks = links.xpath('ul/li/a/@href').extract()
for i in theLinks: #for every link, follow link
yield scrapy.Request(i, callback=self.ContentParse)
def ContentParse(self, response): #scrape content you want
#Grab main Div wrapper for content
divMainContent = response.xpath('//div[@id="main-content"]')
for titles in divMainContent:
#create Item object from items.py function
Item = items.Overflowquestion2Item()
theTitles = titles.xpath('div[@class="content"]//a/text()').extract()
#set Item to the scrapy.Field in items.py
Item['title'] = theTitles
yield Item #yield Item through pipeline
for URLs in divMainContent:
Item = items.Overflowquestion2Item()
theURLs = URLs.xpath('//table/tr/td/a/@href').extract()
Item['URL'] = theURLs
yield Item
现在为items.py:
import scrapy
class Overflowquestion2Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
URL = scrapy.Field()
pass
还要记住在设置中取消注释项目管道,
ITEM_PIPELINES = {
'overflowQuestion2.pipelines.Overflowquestion2Pipeline': 300,
}
我很肯定这可以编码得更好,我希望这里有人会改进它;)