Question

我是Scrapy＆amp; amp;的新手蟒蛇。我尝试从以下网址获取评论，但结果始终为null：http://vnexpress.net/tin-tuc/oto-xe-may/toyota-camry-2016-dinh-loi-tui-khi-khong-bung-3386676.html

这是我的代码：

from scrapy.spiders import Spider
from scrapy.selector import Selector
from tutorial.items import TutorialItem

import logging

class TutorialSpider(Spider):
    name = "vnexpress"
    allowed_domains = ["vnexpress.net"]
    start_urls = [
        "http://vnexpress.net/tin-tuc/oto-xe-may/toyota-camry-2016-dinh-loi-tui-khi-khong-bung-3386676.html"
    ]

    def parse(self, response):
        sel = Selector(response)
        commentList = sel.xpath('//div[@class="comment_item"]')
        items = []
        id = 0;

        logging.log(logging.INFO, "TOTAL COMMENT : " + str(len(commentList)))

        for comment in commentList:
            item = TutorialItem()

            id = id + 1

            item['id'] = id
            item['mainId'] = 0
            item['user'] = comment.xpath('//span[@class="left txt_666 txt_11"]/b').extract()
            item['time'] = 'N/A'
            item['content'] = comment.xpath('//p[@class="full_content"]').extract()
            item['like'] = comment.xpath('//span[@class="txt_666 txt_11 right block_like_web"]/a[@class="txt_666 txt_11 total_like"]').extract()

            items.append(item)

        return items

感谢您阅读

Answer 1

看起来评论已加载到包含一些JavaScript代码的页面中。

Scrapy不会在页面上执行JavaScript，只会下载HTML页面。尝试在浏览器中禁用JavaScript打开页面，您应该看到Scrapy看到的页面。

您有一些选择：

使用浏览器的开发者工具面板，在＆＃34; network＆＃34;中反向设计注释的加载方式。 tab（可能是某些XHR调用加载HTML或JSON数据）
使用（无头）浏览器呈现页面（selenium，casper.js，splash ...）;
- e.g。您可能希望使用Splash（用于网页抓取的JavaScript呈现选项之一）尝试此页面。这是您从Splash返回的HTML（包含注释）：http://pastebin.com/njgCsM9w

Scrapy无法抓取链接 - vnexpress网站的评论

1 个答案: