调试:已抓取(200)(引荐来源:无)

时间:2019-11-16 17:06:19

标签: python xpath scrapy

我正在尝试使用 Scrapy XPath 来抓取网页。这是我的代码和日志,有人可以帮助我。预先感谢!

from scrapy import Spider
from scrapy.selector import Selector
from crawler.items import CrawlerItem

class CrawlerSpider(Spider):
    name = "crawler"
    allowed_domains = ["dayhoctienganh.net"]
    start_urls = [
        "https://dayhoctienganh.net/trac-nghiem-tieng-anh-trinh-do-b",
    ]

    def parse(self, response):
        questions = Selector(response).xpath('//ol[@class="questions"]/li')
        for question in questions:
            item = CrawlerItem()
            item['quest']= question.xpath('/h3/text()').extract_first()
            item['sela']= question.xpath('/ul[@class="answers"]/li[1]/label/text()').extract_first()
            item['selb']= question.xpath('/ul[@class="answers"]/li[2]/label/text()').extract_first()
            item['selc']= question.xpath('/ul[@class="answers"]/li[3]/label/text()').extract_first()
            item['seld']= question.xpath('/ul[@class="answers"]/li[4]/label/text()').extract_first()
            item['key']= question.xpath('/ul[@class="responses"]/li[2]/text()').extract_first()
            yield item
2019-11-16 23:53:53 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-16 23:53:53 [scrapy.core.engine] INFO: Spider opened
2019-11-16 23:53:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-16 23:53:53 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-16 23:53:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://dayhoctienganh.net/trac-nghiem-tieng-anh-trinh-do-b> (referer: None)
2019-11-16 23:53:55 [scrapy.core.engine] INFO: Closing spider (finished)

2 个答案:

答案 0 :(得分:0)

Tried scrapy shell

这是否意味着它什么也不返回? (我也是刮y壳的新手)

Ran

scrapy shell "https://dayhoctienganh.net/trac-nghiem-tieng-anh-trinh-do-b"

html that i followed

答案 1 :(得分:0)

如果使用ctrl / cmd + U打开start_urls的源,将无法找到questions类,并且questions列表将为空,这将导致跳过解析方法中的for循环,因此无法获得所需的结果。此外,answers在网页源中也同样不可用。因此,item的所有字段也将为空。