Html解析器忽略img标签(Golang)

时间:2016-07-10 15:47:28

标签: go

我的任务是在html中找到图片网址

问题

Html解析器golang.org/x/net/html以及 github.com/PuerkitoBio/goquery igonore页面http://www.ozon.ru/context/detail/id/34498204/

上的最大图片

问题

  • 我的代码有什么问题
  • 为什么img标记src=""被忽略?
  • 是否可以通过go获取html中的所有图像?

备注:

  • 当我使用parser written in Swift时,在//static2.ozone.ru/multimedia/spare_covers/1013531536.jpg

  • 页面上找到了此图片
  • 使用正则表达式搜索时找到了此图片代码。

  • 使用第三方服务saveallimages.com时发现此图片代码

  • 我尝试使用gokogiri但是没有成功在我的mac上编译它。 Go get成功,但Go build永远停滞不前。

解析的html页面来源

This is the html这是resp, _ := http.Get(url)

的结果

代码:

package main

import (
  "golang.org/x/net/html"
  "log"
  "net/http"
)


func main() {

  url := "http://www.ozon.ru/context/detail/id/34498204/"

  if resp, err := http.Get(url); err == nil {
    defer resp.Body.Close()

    log.Println("Load page complete")

    if resp != nil {
      log.Println("Page response is NOT nil")

      if document, err := html.Parse(resp.Body); err == nil {

        var parser func(*html.Node)
        parser = func(n *html.Node) {
          if n.Type == html.ElementNode && n.Data == "img" {

            var imgSrcUrl, imgDataOriginal string

            for _, element := range n.Attr {
              if element.Key == "src" {
                imgSrcUrl = element.Val
              }
              if element.Key == "data-original" {
                imgDataOriginal = element.Val
              }
            }

            log.Println(imgSrcUrl, imgDataOriginal)
          }

          for c := n.FirstChild; c != nil; c = c.NextSibling {
            parser(c)
          }

        }
        parser(document)
      } else {
        log.Panicln("Parse html error", err)
      }

    } else {
      log.Println("Page response IS nil")
    }
  }

}

1 个答案:

答案 0 :(得分:2)

This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/html.

There are four possible solutions:

  1. Remove <noscript> and </noscript> in HTML so x/net/html would parse its content as expected. Something like:

    package main
    
    import (
        "golang.org/x/net/html"
        "log"
        "net/http"
        "io/ioutil"
        "strings"
    )
    
    func main() {
    
        url := "http://www.ozon.ru/context/detail/id/34498204/"
    
        if resp, err := http.Get(url); err == nil {
            defer resp.Body.Close()
    
            log.Println("Load page complete")
    
            if resp != nil {
                log.Println("Page response is NOT nil")
                // --------------
                data, _ := ioutil.ReadAll(resp.Body)
                resp.Body.Close()
    
                hdata := strings.Replace(string(data), "<noscript>", "", -1)
                hdata = strings.Replace(hdata, "</noscript>", "", -1)
                // --------------
    
                if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
                    var parser func(*html.Node)
                    parser = func(n *html.Node) {
                        if n.Type == html.ElementNode && n.Data == "img" {
    
                            var imgSrcUrl, imgDataOriginal string
    
                            for _, element := range n.Attr {
                                if element.Key == "src" {
                                    imgSrcUrl = element.Val
                                }
                                if element.Key == "data-original" {
                                    imgDataOriginal = element.Val
                                }
                            }
    
                            log.Println(imgSrcUrl, imgDataOriginal)
                        }
    
                        for c := n.FirstChild; c != nil; c = c.NextSibling {
                            parser(c)
                        }
    
                    }
                    parser(document)
                } else {
                    log.Panicln("Parse html error", err)
                }
    
            } else {
                log.Println("Page response IS nil")
            }
        }
    
    }
    
  2. Patch x/net/html with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec

  3. Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)

  4. Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.