我尝试使用scrapy xpath来抓取页面,但是当我使用for循环时,它似乎无法使用谓词捕获标记, #此包将包含Scrapy项目的蜘蛛
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[@name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[@name = classname]/text()")
我正在寻找名称为" MTG_CLASSNAME $" + str(n),n为0,1,2 ...,我的xpath查询得到空输出。不知道为什么......
谢谢!
答案 0 :(得分:1)
嗯......我已经访问了您在问题描述中放入的网站,我使用了元素检查并搜索了" MTG_CLASSNAME"我有0场比赛......
所以我会给你一些工具:
在您的settings.py中设置:
LOG_FILE =" log.txt"
LOG_STDOUT =真
然后打印响应主体(response.body)你应该(在这种情况下在parse_page函数的顶部)并在log.txt中搜索
检查是否有您要找的东西。
此外,更改for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]"):
通过for section in response.xpath("//a[contains(@name,'MTG_CLASS_NBR')]").extract():
,当您获得所需的数据时,这将引发错误。