如何正确抓取本网站的内容?

时间:2021-02-20 18:35:23

标签: html flutter dart web-scraping

我是 Web 开发的新手,我不知道如何从同一个网站的 4 个 url 中检索内容,我总是得到空值。我正在使用颤振和包 web_scraper: ^0.0.8

我需要从站点检索标题、图片、描述和网址,我将导航的页面是:

https://datassette.org/revistas(类别)

https://datassette.org/revistas/videogames(选择杂志语言)

https://datassette.org/revistas/br-brasil(杂志(标题、图片和网址))

https://datassette.org/revistas/acao-games/semana-em-acao-especial-games-no-1(杂志标题、描述、图片、pdf 格式的网址)。

.getElement 方法是什么?

 /// Returns List of elements found at specified address.
  /// Example address: "div.item > a.title" where item and title are class names of div and a tag respectively.
  List<Map<String, dynamic>> getElement(String address, List<String> attribs) {
    // Attribs are the list of attributes required to extract from the html tag(s) ex. ['href', 'title'].



import 'package:web_scraper/web_scraper.dart';

class WebScraperHelper {

  static final webScraper = WebScraper('https://datassette.org');

  static Future<void> getData() async{

    if (await webScraper.loadWebPage('/revistas')) {

    // it prints the full html
    //print("getPageContent: ${webScraper.getPageContent()}");

      List<Map<String, dynamic>> images = webScraper.getElement(
          'img.width-full.wt-height-full.display-block.position-absolute',
          ['src']);

      List<Map<String, dynamic>> descriptions = webScraper.getElement(
          'h3.text-gray.text-truncate.mb-xs-0.text-body', ['title']);

      List<Map<String, dynamic>> urls = webScraper.getElement(
          'div > ul > li > div > a',
          ['href', 'title']);

    print("images: $images"); // print []
    print("descriptions: $descriptions"); // print []
    print("urls: $urls"); // print []

    }

  }

}

2 个答案:

答案 0 :(得分:0)

我不使用那个包,我更习惯使用正则表达式,这里有一个例子:

import 'dart:async';

import 'package:_samples2/networking.dart';


// get Categories
const kUrlRevistas = 'https://datassette.org/revistas';

var regExp1 = RegExp(r'<a href="\/revistas\/\p{L}+">(\p{L}+)<\/a>', unicode: true);

class Revistas {
  static Future fetchRevistas () async {
    print('Start fetching...');
    return await NetService.getRaw(kUrlRevistas)
      .whenComplete(() => print('Fetching done!'));
  }
}

void main(List<String> args) async {
  var data = await Revistas.fetchRevistas();
  var matches = regExp1.allMatches(data);
  
  print(matches.map((e) => e.group(1)).toList());
}

结果:

Start fetching...
Fetching done!
[Diversas, Eletrônica, Informática, Videogames]

P.S.:您需要阅读 HTML 代码。

答案 1 :(得分:0)

经过数小时的测试,我找到了一种方法来检索我需要的所有数据。

static Future<void> getMagazines() async {

    if (await webScraper.loadWebPage('/revistas/acao-games')) {

      List<Map<String, dynamic>> maps = [];

      List<Map<String, dynamic>> titles = webScraper.getElement(
          'span.field-content > a',
          []
      );

      // adicionar somente  os mapas que tiverem dado no atributo title

      List<Map<String, dynamic>> urls = webScraper.getElement(
          'div.field-content > a',
          [ 'href']
      );

      List<Map<String, dynamic>> images = webScraper.getElement(
          'div.field-content > a > img',
          [ 'src']
      );

      for(int i=0; i < urls.length; i++){

        maps.add({
          "title": titles[i]["title"],
          "url" : urls[i]["attributes"]["href"],
          "image" : images[i]["attributes"]["src"],
        });

      }

      //print("TITLES: $titles");
      //print("URLS: $urls");
      //print("IMAGES: $images");
      print(maps.length);


    }