Spyder 网络抓取工具不会访问新网址

时间:2021-04-18 22:58:17

标签: python scrapy sitemap

我有一个 spyder,它遍历所有站点地图,并为每个 url 添加 &com=1 到 url 的末尾并请求它获取标题和评论。但是由于某种原因,请求没有通过,或者 xpath 没有找到任何东西。我知道如果我们有请求,则需要使用替换,但是当我们进入每个周期时,这仍然适用吗?如果是的话,当我们不再有那个方法时,我们如何更换?

测试网址:https://www.delfi.lt/news/daily/hot/apsinuoginusios-moterys-sutrikde-madu-sou.d?id=112&com=1

xpath 在网站上的开发控制台中工作。

代码:

class MySpider(scrapy.Spider):
       
    name = "delfi"
    root = 'http://www.delfi.lt'
    start_urls = ['https://www.delfi.lt/sitemap.xml']
    custom_settings = {
        'USER_AGENT': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0",
        'ITEM_PIPELINES': {'__main__.ArticlesPipeline': 300},
        'HTTPCACHE_ENABLED': True,
        'HTTPCACHE_EXPIRATION_SECS': 0,
        'DOWNLOAD_DELAY': 0.1,
        'LOG_LEVEL': 'INFO',
        'LOG_FILE': 'delfi_scraping_logs.log',
        'COOKIES_ENABLED':False

    }
    

    
    def try_to_do(self, func, arg):
        try:
            return func(arg)
        except Exception:
            self.logger.exception('trying to do: ')

    def parse(self, response):
        response.selector.remove_namespaces()
        sitemaps = response.xpath('//loc/text()').extract()

        for sitemap in sitemaps:
            yield scrapy.Request(sitemap, callback=self.parse_sitemap)


    def parse_sitemap(self, response):
        response.selector.remove_namespaces()
        articles = response.xpath('//loc/text()').extract()
        logger.info(f'starting articles from sitemap {response.url}')
        for article in articles:
            if "delfi.lt/video" in article or "delfi.lt/apps" in article or \
                    "delfi.lt/temos/" in article or 'delfi.lt/images/' in article: #or \
                    #article in self.existing_urls
                logger.info(f'skipping {article}')
                continue
            
            ### This does not work

            # new_article = article + '&com=1' #direct change did not help
            payload = {'com' : 1}

            ### This does not work

            yield scrapy.Request(article+ "&" + urlencode(payload),  callback=self.parse_article)
            

    def manage_sequence_of_strings(self, seq):
        return ' '.join([s.strip() for s in seq]).replace('\xa0', ' ').replace('\n', ' ').replace('  ', ' ').strip()

    def get_fields(self, response):

        kwargs = {k: self.try_to_do(v, response) for k, v in [('author', self.xauthor),
                                                                  ('title', self.xtitle),
                                                                  ]}

      
        return kwargs

    

    def xauthor(self, resp):
        author = resp.xpath('//div[@class="delfi-source-name"]/text()').extract()
        if len(author) > 0:
            return author[0]


    def xtitle(self, resp):
        return self.manage_sequence_of_strings(        
            resp.xpath('//*[@class="article-title"]//text()').extract())




    def parse_article(self, response):
        url = response.url
        print('URl = ' + url)
        
        kwargs = self.get_fields(response)
        
        yield {'url': url, 'website':'delfi', 'category': None, 'success': all(kwargs.values()), **kwargs}

0 个答案:

没有答案
相关问题