Question

美好的一天。我目前正在编写一个Scrapy程序来抓取新闻网站。我是Scrapy的初学者，但遇到了麻烦，无法继续编写代码。

我当前试图删除的网站是https://www.thestar.com.my/news/nation

页面的html标记内有一个 div 标记，其中包含 class =“ row list-listing” 。我正在尝试在 div 标记内获取 paragraph 标记，但是Scrapy似乎无法找到标签。

我已经检查了所有未关闭的标签，但它们似乎都已关闭。那么，为什么Scrapy无法获取此标签？ Scrapy可以获取的最内部的标签是 div class =“ sub-section-list” ，它位于 div class =“ row list-listing”

之外

此外，当我获取 div class =“ sub-section-list” 标签时，它只会提取以下html标签：

"<div class=""sub-section-list"">
     <div class=""button-view btnLoadMore"" style=""margin: 10px auto 15px;"">
         <a id=""loadMorestories"">Load more </a>
     </div>
 </div>"

在检查网站时，我需要这些标签

Website Tag

我将包括我的基本代码。我只是开始了这个项目，所以自从出现这个问题以来，我没有取得任何进展。

import scrapy


class WebCrawl(scrapy.Spider):
    name = "spooder"
    allowed_domains = ["thestar.com.my"]
    start_urls = ["https://www.thestar.com.my/news/nation"]

    def parse(self, response):
        text = response.xpath("//div[@class='sub-section-list']").extract()
        yield {
            'text' : text
        }

如果我忘记添加任何其他必要的东西，请告诉。任何帮助将不胜感激。

Answer 1

正如Wim所说，页面是动态加载的，因此有a few options。使用Firefox开发人员工具，似乎正在从以下内容中检索内容：

https://cdn.thestar.com.my/Content/Data/parsely_data.json

因此您可以直接加载json并从中获取所需内容。像这样：

import scrapy
import json

class WebCrawl(scrapy.Spider):
    name = "spooder"
    allowed_domains = ["thestar.com.my"]
    start_urls = ["https://cdn.thestar.com.my/Content/Data/parsely_data.json"]

    def parse(self, response):
        yield from json.loads(response.text)['data']

当然，这可能不完全是您想要的，但这也许是一个好的开始？

（请注意，上面的代码对它的作用是过大的，但是如果您要开始进行抓取，可以从中进行工作）

Answer 2

内容是动态加载的，因此如果不呈现页面，您将无法使用xpath。似乎该文章正文存在于html中，您可以按以下方式获取它：

import json
script = response.xpath(
  "//script[contains(text(), 'var listing = ')]/text()"
).extract_first()

first_index = script.index('var listing = ') + len('var listing = ')
last_index = script.index('};') + 1
listings = json.loads(script[first_index:last_index])
articles = [article['article_body'] for article in listings['data']]

Scrapy无法在<div>标记内找到

2 个答案: