Scraping Dawn新闻网站返回(引用者:无)

时间:2017-10-23 11:20:01

标签: python python-3.x scrapy web-crawler

我的报废代码的结果返回(referer:none)新闻网站以下是代码,我为BBC尝试了相同的代码,它工作正常,但对于这个网站,它没有返回所需的结果。< / p>

import os
import scrapy


newpath = 'urdu_data' 
if not os.path.exists(newpath):
    os.makedirs(newpath)


class UrduSpider(scrapy.Spider):
    name = "urdu"
    start_urls = [
        'https://www.dawnnews.tv',
        'https://www.dawnnews.tv/latest-news'
        'https://www.dawnnews.tv/news'
        'https://www.dawnnews.tv/tech'
    ]

    def should_process_page(self, page_url):



        for s_url in self.start_urls:
            if page_url.startswith(s_url) and page_url != s_url:
                return True

        return False

    def parse(self, response):

        if self.should_process_page(response.url):
            page_id = response.url.split("/")[-1]
            filename = page_id + '.txt'

            # if response has story body, we save it's contents
            story_body = response.css('div.story__content')
            story_paragraphs_text = story_body.css('p::text')
            page_data = ''
            for p in story_paragraphs_text:
                page_data += p.extract() + '\n'

            if page_data:
                open('urdu_data/' + filename, 'w').write(page_data)

            # Now follow any links that are present on the page
            links = response.css('a.title-link ::attr(href)').extract()
            for link in links:
                yield scrapy.Request(
                    response.urljoin(link),
                    callback=self.parse
                )

1 个答案:

答案 0 :(得分:0)

我认为你需要像下面这样的start_urls

start_urls = [
        'https://www.dawnnews.tv',
        'https://www.dawnnews.tv/latest-news',
        'https://www.dawnnews.tv/news',
        'https://www.dawnnews.tv/tech'
    ]

你没有在上面提到的代码中用逗号分隔的网址,所以第一个只需要两个网址,另外三个网址被追加并用作一个网址,请在每个网址后面加上逗号,如上所述

下一个story_body = response.css('div.story__content')表示在url给出的页面中应该有一个div元素,其中包含class = story__content,我认为在提到的url中缺少这个元素。只需快速查看{{3}的html它似乎有类似story__excerpt的东西,因为div类不确定这是否是你需要的。无论如何你需要检查这些页面的html并获得正确的内容。

你可以做的是调试这个是使用print语句并打印出story_body,story_paragraphs_text并检查你是否得到了这些输出。希望这对你进行必要的调试很有帮助。

017-10-23 22:11:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dawnnews.tv> (referer: None)
https://www.dawnnews.tv
2017-10-23 22:11:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dawnnews.tv/news> (referer: None)
https://www.dawnnews.tv/news
news.txt
[]
2017-10-23 22:11:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dawnnews.tv/tech> (referer: None)
https://www.dawnnews.tv/tech
tech.txt
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">فیس '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">یوٹی'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">واٹس'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">ویب '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">ابھی'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">8 سا'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">اسما'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">دنیا'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">فیس '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">سوشل'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        "> فیس'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">اگر '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">اس ف'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">بہت '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">اب پ'>]
2017-10-23 22:11:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dawnnews.tv/latest-news> (referer: None)
https://www.dawnnews.tv/latest-news
latest-news.txt
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">فلم '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">فیس '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">چیئر'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">غذا '>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">جوڈی'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">ہولی'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]" data='<div class="story__excerpt        ">پاکس'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' story__excerpt ')]"       ">

上面用于

的代码
import os
import scrapy


newpath = 'urdu_data' 
if not os.path.exists(newpath):
    os.makedirs(newpath)


class UrduSpider(scrapy.Spider):
    name = "urdu"
    start_urls = [
        'https://www.dawnnews.tv',
        'https://www.dawnnews.tv/latest-news',
        'https://www.dawnnews.tv/news',
        'https://www.dawnnews.tv/tech'
    ]

    def should_process_page(self, page_url):



        for s_url in self.start_urls:
            if page_url.startswith(s_url) and page_url != s_url:
                return True

        return False

    def parse(self, response):
        print(response.url)
        if self.should_process_page(response.url):
            page_id = response.url.split("/")[-1]
            filename = page_id + '.txt'
            print(filename)

            # if response has story body, we save it's contents
            story_body = response.css('div.story__excerpt')
            print(story_body)
            story_paragraphs_text = story_body.css('p::text')
            page_data = ''
            for p in story_paragraphs_text:
                page_data += p.extract() + '\n'

            if page_data:
                open('urdu_data/' + filename, 'w').write(page_data)

            # Now follow any links that are present on the page
            links = response.css('a.title-link ::attr(href)').extract()
            for link in links:
                yield scrapy.Request(
                    response.urljoin(link),
                    callback=self.parse
                )

您需要进行类似的更改,以便根据页面的html结构从其他元素获取响应。