Question

我是Go的新手，我现在正在努力解析一些HTML。

HTML看起来像：

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>

    <div>something</div>

    <div id="publication">
        <div>I want <span>this</span></div>
    </div>

    <div>
        <div>not this</div>
    </div>

</body>
</html>

我希望将其作为一个字符串：

<div>I want <span>this</span></div>

我已经尝试过html.NewTokenizer（）（来自golang.org/x/net/html），但似乎无法从令牌或节点获取元素的全部内容。我也尝试过使用这个深度，但是它拾取了其他代码。

我还有一个看起来很完美的goquery，代码：

doc, err := goquery.NewDocument("{url}")
if err != nil {
    log.Fatal(err)
}

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    fmt.Printf("Review %d: %s\n", i, s.Html())
})

但是s.Text（）只打印出文本，而s.Html（）似乎不存在（？）。

我认为将其解析为XML是可行的，除了实际的HTML非常深，并且每个父元素都必须有一个结构...

任何帮助都会很棒！

Answer 1

你没有得到结果（s.Html（）实际存在），因为你没有设置变量和错误处理程序。

请将此添加到您的代码中，它将正常工作：

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    inside_html,_ := s.Html() //underscore is an error
    fmt.Printf("Review %d: %s\n", i, inside_html)
})

如何获取HTML元素的内容

1 个答案: