如何在Golang中仅从HTML中提取文本?

时间:2017-06-08 17:00:10

标签: html go text

要从HTML中提取文字,我使用完全HTML5-compliant tokenizer and parser,就像这样

    s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

    domDocTest := html.NewTokenizer(strings.NewReader(s))
    for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
        if tokenType != html.TextToken {
            tokenType = domDocTest.Next()
            continue
        }
        TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
        if len(TxtContent) > 0 {
            fmt.Printf("%s\n", TxtContent)
        }
        tokenType = domDocTest.Next()
    }

但我得到了这个结果

Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */

我不想要CDATA内容。一些想法,如何只获取文本内容?

2 个答案:

答案 0 :(得分:2)

如果你使用github.com/PuerkitoBio/goquery,很容易达到你想要的效果。

  • 您首先需要使用document.Find()来标识要移除的元素,例如<head> <link rel="stylesheet" href="https://cdn.gitcdn.link/cdn/angular/bower-material/v1.1.3/angular-material.css" /> <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,400,500,700,400italic"> <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular.js"></script> <script src="https://cdn.gitcdn.link/cdn/angular/bower-material/v1.1.3/angular-material.js"></script> <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular-animate.min.js"></script> <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular-aria.min.js"></script> <style> md-chip { clear: left; } .md-chips { background-color: beige; } .md-chips .md-chip-input-container { float: none; } .md-chip-input-container .md-input { border: 1px solid black !important; margin-top: 8px; } </style> </head> <body> <div ng-controller="demoCtrl" ng-app="MyApp"> <md-chips ng-model="myChips"> <md-chip-template> <strong>{{$chip}}</strong> </md-chip-template> </md-chips> </div> </body>scripts

  • 然后,您需要使用element.Remove()

  • 将其从文档中删除
  • 最后,使用document.Text()

  • 打印/获取文字

所以,最终的代码是

document.Find(scripts)

答案 1 :(得分:2)

如@Eric Pauley所示,我看TextTokens&amp; StartTagTokens。 这是我的解决方案

    s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`

    domDocTest := html.NewTokenizer(strings.NewReader(s))
    previousStartTokenTest := domDocTest.Token()
loopDomTest:
    for {
        tt := domDocTest.Next()
        switch {
        case tt == html.ErrorToken:
            break loopDomTest // End of the document,  done
        case tt == html.StartTagToken:
            previousStartTokenTest = domDocTest.Token()
        case tt == html.TextToken:
            if previousStartTokenTest.Data == "script" {
                continue
            }
            TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
            if len(TxtContent) > 0 {
                fmt.Printf("%s\n", TxtContent)
            }
        }
    }