我的任务是在html中找到图片网址
问题
Html解析器golang.org/x/net/html
以及
github.com/PuerkitoBio/goquery
igonore页面http://www.ozon.ru/context/detail/id/34498204/
问题
img
标记src=""
被忽略?备注:
当我使用parser written in Swift时,在//static2.ozone.ru/multimedia/spare_covers/1013531536.jpg
使用正则表达式搜索时找到了此图片代码。
使用第三方服务saveallimages.com时发现此图片代码
我尝试使用gokogiri但是没有成功在我的mac上编译它。 Go get
成功,但Go build
永远停滞不前。
解析的html页面来源
This is the html这是resp, _ := http.Get(url)
代码:
package main
import (
"golang.org/x/net/html"
"log"
"net/http"
)
func main() {
url := "http://www.ozon.ru/context/detail/id/34498204/"
if resp, err := http.Get(url); err == nil {
defer resp.Body.Close()
log.Println("Load page complete")
if resp != nil {
log.Println("Page response is NOT nil")
if document, err := html.Parse(resp.Body); err == nil {
var parser func(*html.Node)
parser = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "img" {
var imgSrcUrl, imgDataOriginal string
for _, element := range n.Attr {
if element.Key == "src" {
imgSrcUrl = element.Val
}
if element.Key == "data-original" {
imgDataOriginal = element.Val
}
}
log.Println(imgSrcUrl, imgDataOriginal)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
parser(c)
}
}
parser(document)
} else {
log.Panicln("Parse html error", err)
}
} else {
log.Println("Page response IS nil")
}
}
}
答案 0 :(得分:2)
This is not a bug but expected behaviour of x/net/html
which affects all parsers based on x/net/html
.
There are four possible solutions:
Remove <noscript>
and </noscript>
in HTML so x/net/html
would parse its content as expected. Something like:
package main
import (
"golang.org/x/net/html"
"log"
"net/http"
"io/ioutil"
"strings"
)
func main() {
url := "http://www.ozon.ru/context/detail/id/34498204/"
if resp, err := http.Get(url); err == nil {
defer resp.Body.Close()
log.Println("Load page complete")
if resp != nil {
log.Println("Page response is NOT nil")
// --------------
data, _ := ioutil.ReadAll(resp.Body)
resp.Body.Close()
hdata := strings.Replace(string(data), "<noscript>", "", -1)
hdata = strings.Replace(hdata, "</noscript>", "", -1)
// --------------
if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
var parser func(*html.Node)
parser = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "img" {
var imgSrcUrl, imgDataOriginal string
for _, element := range n.Attr {
if element.Key == "src" {
imgSrcUrl = element.Val
}
if element.Key == "data-original" {
imgDataOriginal = element.Val
}
}
log.Println(imgSrcUrl, imgDataOriginal)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
parser(c)
}
}
parser(document)
} else {
log.Panicln("Parse html error", err)
}
} else {
log.Println("Page response IS nil")
}
}
}
Patch x/net/html
with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec
Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)
Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.