如何解析Golang中的HTTP.GET响应

时间:2016-04-10 19:37:07

标签: go html-parsing html-escape-characters

我正在从我正在访问的网址获得此类响应,我需要解析它以获得所需的HTML。

  

this = ajax({“htmlInfo”:“SOME-HTML”,“otherInfo”:“Blah Blah”,“moreInfo”:“Bleh Bleh”})

如上所述,我有三个密钥对值,我需要从中获取“SOME-HTML”,我怎么能得到它,主要问题是“SOME-HTML”有转义字符。以下是将出现的响应类型。

  

\ u003Cdiv class = \ u0022container columns-2 \ u0022 \ u003E \ n \ n \ u003Csection class = \ u0022col-main \ u0022 \ u003E \ n \ r \ n \ u \ u003cdiv class = \ u0027visor-article-list list list -view-recent \ u0027 \ u \ n \ u003Cdiv class = \ u0027grid_item visor-article-teaser list_default \ u0027 \ u003E \ n \ u003Ca class = \ u0027grid_img \ u0027 href = \ u0027 / manUnited-is-the-best \ u0027 \ u003E \ n \ u003Cimg src = \ u0022http://www.xyz.com/sites//files/styles/w400h22

任何人都可以在这方面帮助我。我不知道如何解决这个问题。

提前致谢。

1 个答案:

答案 0 :(得分:1)

最简单的方法是提取JSON,然后将其解组为结构。 \uXXXX部分是unicode字符

package main

import (
    "encoding/json"
    "fmt"
    "regexp"
)

// Data follows the structure of the JSON data in the response
type Data struct {
    HTMLInfo  string `json:"htmlInfo"`
    OtherInfo string `json:"otherInfo"`
    MoreInfo  string `json:"moreInfo"`
}

func main() {
    // input is an example of the raw response data. It's probably a []byte if
    // you got it from ioutil.ReadAll(resp.Body)
    input := []byte(`this=ajax({"htmlInfo":"\u003Cdiv class=\u0022container columns-2\u0022\u003E\n\n \u003Csection class=\u0022col-main\u0022\u003E\n \r\n\u003Cdiv class=\u0027visor-article-list list list-view-recent\u0027 \u003E\r\n\u003Cdiv class=\u0027grid_item visor-article-teaser list_default\u0027 \u003E\n \u003Ca class=\u0027grid_img\u0027 href=\u0027/manUnited-is-the-best\u0027\u003E\n \u003Cimg src=\u0022http://example.com/sites//files/styles/w400h22", "otherInfo": "Blah Blah", "moreInfo": "Bleh Bleh"})`)

    // First we want to extract the data json using regex with a capture group.
    dataRegex, err := regexp.Compile("ajax\\((.*)\\)")
    if err != nil {
        fmt.Println("regex failed to compile:", err)
        return
    }

    // FindSubmatch should return two matches:
    // 0: The full match
    // 1: The contents of the capture group (what we want)
    matches := dataRegex.FindSubmatch(input)
    if len(matches) != 2 {
        fmt.Println("incorrect number of match results:", len(matches))
        return
    }
    dataJSON := matches[1]

    // Since the data is in JSON format, we can unmarshal it into a struct.  If
    // you don't care at all about the fields other than "htmlInfo", you can
    // omit them from the struct.
    data := &Data{}
    if err := json.Unmarshal(dataJSON, data); err != nil {
        fmt.Println("failed to unmarshal data json:", err)
    }

    // You now have access to the "htmlInfo" property
    fmt.Println("HTML INFO:", data.HTMLInfo)
}

将产生:

HTML INFO: <div class="container columns-2">

 <section class="col-main">

<div class='visor-article-list list list-view-recent' >
<div class='grid_item visor-article-teaser list_default' >
 <a class='grid_img' href='/manUnited-is-the-best'>
 <img src="http://example.com/sites//files/styles/w400h22