xpath无法识别标记的谓词

时间:2018-06-01 22:00:39

标签: xpath web-scraping scrapy

我尝试使用scrapy xpath来抓取页面,但是当我使用for循环时,它似乎无法使用谓词捕获标记,     #此包将包含Scrapy项目的蜘蛛

from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json

class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]

def parse(self, response):
    url = response.url
    yield scrapy.Request(url, self.parse_page)

def parse_page(self, response):

    n = -1
    for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"):
        print(response.xpath("//a[@name ='MTG_CLASSNAME$10']/text()"))

        n += 1

        class_num = section.xpath('text()').extract_first()
        # print(class_num)
        classname = "MTG_CLASSNAME$" + str(n)
        date = "MTG_DAYTIME$" + str(n)
        instr = "MTG_INSTR$" + str(n)
        print(classname)

        class_name = response.xpath("//a[@name = classname]/text()")

我正在寻找名称为" MTG_CLASSNAME $" + str(n),n为0,1,2 ...,我的xpath查询得到空输出。不知道为什么......

PS。 我基本上试图从https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes中搜集课程及其信息 应用过滤器:Kingsborough CC,18岁,BIO

谢谢!

1 个答案:

答案 0 :(得分:1)

嗯......我已经访问了您在问题描述中放入的网站,我使用了元素检查并搜索了" MTG_CLASSNAME"我有0场比赛......

所以我会给你一些工具:

  • 在您的settings.py中设置:

    LOG_FILE =" log.txt"

    LOG_STDOUT =真

    然后打印响应主体(response.body)你应该(在这种情况下在parse_page函数的顶部)并在log.txt中搜索

  • 检查是否有您要找的东西。

  • 如果有,请使用此https://www.freeformatter.com/xpath-tester.html( 或者类似的)检查你的xpath语句。

此外,更改for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"): 通过for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]").extract():,当您获得所需的数据时,这将引发错误。