Question

我是python的新手，我在使用 scrapy 跟踪网址时遇到了一些麻烦。我怀疑它可能符合xpath规范，但是在对该主题做了几个教程后，我并没有接近解决这个问题。它遍历引用表中的url并重复地从起始页面中抓取内容。我究竟做错了什么？

附加代码：

import scrapy  
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = 'unespider'
    allowed_domains = ['https://my.une.edu.au/']
    start_urls = ['https://my.une.edu.au/courses/']
    rules = Rule(LinkExtractor(canonicalize = True, unique = True), follow = True, callback = "parse"),    

    def parse(self, response):
        hxs = Selector(response)
        for url in response.xpath('//*'):
        yield {
            'title': url.xpath('//*[@id="main-content"]/div/h2/a/text()').extract_first(),
            'avail': url.xpath('//*[@id="overviewTab-snapshotDiv"]/p[3]/a/text()').extract_first(),
               }       

        for url in hxs.xpath('//tr/td/a/@href').extract():
            yield Request(response.urljoin(url), callback=self.parse)

Answer 1

**更新我看到你想要的和更新的代码，现在每年都会跟随并输出正确的**

我很抱歉，我不确定你要从起始页面开始遵循什么，以及这是什么// * [@ id =＆＃34; overviewTab-snapshotDiv＆＃34;]。我无法找到xpath。我想帮助你更多，因为我还很新，编程和scrapy一开始很难，我最后制作了自己的类刮刀，所以我可以按照自己的方式去做，即使我和＃39;我肯定scrapy更好:) 我已经完成了扫描你的标题和网址的代码，我评论了规则，因为我不知道你想要遵循什么或为什么。

import scrapy
from scrapy.spiders import Rule

from scrapy.linkextractors import LinkExtractor

from overflowQuestion2 import items #make sure to import items.py


class DavesimSpider(scrapy.Spider):

    name = 'daveSim'
    allowed_domains = ['my.une.edu.au']
    start_urls = ['http://my.une.edu.au/courses/2007', ] #start at 2007
    rules = Rule(LinkExtractor(canonicalize=True, unique=True), follow=True, callback="parse")


def parse(self, response): #This will scrape links and follow

    #Grab the main Div wrapping the links
    divLinkWrapper = response.xpath('//div[@class="pagination"]') 
    for links in divLinkWrapper: #for every element extract links
        theLinks = links.xpath('ul/li/a/@href').extract()
        for i in theLinks: #for every link, follow link
            yield scrapy.Request(i, callback=self.ContentParse)



def ContentParse(self, response): #scrape content you want

    #Grab main Div wrapper for content
    divMainContent = response.xpath('//div[@id="main-content"]')

    for titles in divMainContent:
        #create Item object from items.py function
        Item = items.Overflowquestion2Item() 
        theTitles = titles.xpath('div[@class="content"]//a/text()').extract()

        #set Item to the scrapy.Field in items.py
        Item['title'] = theTitles 
        yield Item #yield Item through pipeline

    for URLs in divMainContent:
        Item = items.Overflowquestion2Item()
        theURLs = URLs.xpath('//table/tr/td/a/@href').extract()
        Item['URL'] = theURLs
        yield Item

现在为items.py：

import scrapy


class Overflowquestion2Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    URL = scrapy.Field()

    pass

还要记住在设置中取消注释项目管道，

ITEM_PIPELINES = {
    'overflowQuestion2.pipelines.Overflowquestion2Pipeline': 300,
}

我很肯定这可以编码得更好，我希望这里有人会改进它;）

Python Scrapy Xpath不跟踪网址

1 个答案: