Question

我的任务是在html中找到图片网址

问题

Html解析器golang.org/x/net/html以及 github.com/PuerkitoBio/goquery igonore页面http://www.ozon.ru/context/detail/id/34498204/

上的最大图片

问题

我的代码有什么问题
为什么img标记src=""被忽略？
是否可以通过go获取html中的所有图像？

备注：

当我使用parser written in Swift时，在//static2.ozone.ru/multimedia/spare_covers/1013531536.jpg
使用正则表达式搜索时找到了此图片代码。
使用第三方服务saveallimages.com时发现此图片代码
我尝试使用gokogiri但是没有成功在我的mac上编译它。 Go get成功，但Go build永远停滞不前。

解析的html页面来源

This is the html这是resp, _ := http.Get(url)

的结果

代码：

package main

import (
  "golang.org/x/net/html"
  "log"
  "net/http"
)


func main() {

  url := "http://www.ozon.ru/context/detail/id/34498204/"

  if resp, err := http.Get(url); err == nil {
    defer resp.Body.Close()

    log.Println("Load page complete")

    if resp != nil {
      log.Println("Page response is NOT nil")

      if document, err := html.Parse(resp.Body); err == nil {

        var parser func(*html.Node)
        parser = func(n *html.Node) {
          if n.Type == html.ElementNode && n.Data == "img" {

            var imgSrcUrl, imgDataOriginal string

            for _, element := range n.Attr {
              if element.Key == "src" {
                imgSrcUrl = element.Val
              }
              if element.Key == "data-original" {
                imgDataOriginal = element.Val
              }
            }

            log.Println(imgSrcUrl, imgDataOriginal)
          }

          for c := n.FirstChild; c != nil; c = c.NextSibling {
            parser(c)
          }

        }
        parser(document)
      } else {
        log.Panicln("Parse html error", err)
      }

    } else {
      log.Println("Page response IS nil")
    }
  }

}

Answer 1

This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/html.

There are four possible solutions:

Remove <noscript> and </noscript> in HTML so x/net/html would parse its content as expected. Something like:

package main

import (
    "golang.org/x/net/html"
    "log"
    "net/http"
    "io/ioutil"
    "strings"
)

func main() {

    url := "http://www.ozon.ru/context/detail/id/34498204/"

    if resp, err := http.Get(url); err == nil {
        defer resp.Body.Close()

        log.Println("Load page complete")

        if resp != nil {
            log.Println("Page response is NOT nil")
            // --------------
            data, _ := ioutil.ReadAll(resp.Body)
            resp.Body.Close()

            hdata := strings.Replace(string(data), "<noscript>", "", -1)
            hdata = strings.Replace(hdata, "</noscript>", "", -1)
            // --------------

            if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
                var parser func(*html.Node)
                parser = func(n *html.Node) {
                    if n.Type == html.ElementNode && n.Data == "img" {

                        var imgSrcUrl, imgDataOriginal string

                        for _, element := range n.Attr {
                            if element.Key == "src" {
                                imgSrcUrl = element.Val
                            }
                            if element.Key == "data-original" {
                                imgDataOriginal = element.Val
                            }
                        }

                        log.Println(imgSrcUrl, imgDataOriginal)
                    }

                    for c := n.FirstChild; c != nil; c = c.NextSibling {
                        parser(c)
                    }

                }
                parser(document)
            } else {
                log.Panicln("Parse html error", err)
            }

        } else {
            log.Println("Page response IS nil")
        }
    }

}

Patch x/net/html with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec
Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)
Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.

Html解析器忽略img标签（Golang）

1 个答案: