我正在尝试使用 Scrapy 和 XPath 来抓取网页。这是我的代码和日志,有人可以帮助我。预先感谢!
from scrapy import Spider
from scrapy.selector import Selector
from crawler.items import CrawlerItem
class CrawlerSpider(Spider):
name = "crawler"
allowed_domains = ["dayhoctienganh.net"]
start_urls = [
"https://dayhoctienganh.net/trac-nghiem-tieng-anh-trinh-do-b",
]
def parse(self, response):
questions = Selector(response).xpath('//ol[@class="questions"]/li')
for question in questions:
item = CrawlerItem()
item['quest']= question.xpath('/h3/text()').extract_first()
item['sela']= question.xpath('/ul[@class="answers"]/li[1]/label/text()').extract_first()
item['selb']= question.xpath('/ul[@class="answers"]/li[2]/label/text()').extract_first()
item['selc']= question.xpath('/ul[@class="answers"]/li[3]/label/text()').extract_first()
item['seld']= question.xpath('/ul[@class="answers"]/li[4]/label/text()').extract_first()
item['key']= question.xpath('/ul[@class="responses"]/li[2]/text()').extract_first()
yield item
2019-11-16 23:53:53 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-16 23:53:53 [scrapy.core.engine] INFO: Spider opened
2019-11-16 23:53:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-16 23:53:53 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-16 23:53:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dayhoctienganh.net/trac-nghiem-tieng-anh-trinh-do-b> (referer: None)
2019-11-16 23:53:55 [scrapy.core.engine] INFO: Closing spider (finished)
答案 0 :(得分:0)
这是否意味着它什么也不返回? (我也是刮y壳的新手)
Ran
scrapy shell "https://dayhoctienganh.net/trac-nghiem-tieng-anh-trinh-do-b"
答案 1 :(得分:0)
如果使用ctrl / cmd + U打开start_urls
的源,将无法找到questions
类,并且questions
列表将为空,这将导致跳过解析方法中的for循环,因此无法获得所需的结果。此外,answers
在网页源中也同样不可用。因此,item
的所有字段也将为空。