要从HTML中提取文字,我使用完全HTML5-compliant tokenizer and parser,就像这样
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
if tokenType != html.TextToken {
tokenType = domDocTest.Next()
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
tokenType = domDocTest.Next()
}
但我得到了这个结果
Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
我不想要CDATA
内容。一些想法,如何只获取文本内容?
答案 0 :(得分:2)
如果你使用github.com/PuerkitoBio/goquery,很容易达到你想要的效果。
您首先需要使用document.Find()来标识要移除的元素,例如<head>
<link rel="stylesheet" href="https://cdn.gitcdn.link/cdn/angular/bower-material/v1.1.3/angular-material.css" />
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,400,500,700,400italic">
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular.js"></script>
<script src="https://cdn.gitcdn.link/cdn/angular/bower-material/v1.1.3/angular-material.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular-animate.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.5/angular-aria.min.js"></script>
<style>
md-chip {
clear: left;
}
.md-chips {
background-color: beige;
}
.md-chips .md-chip-input-container {
float: none;
}
.md-chip-input-container .md-input {
border: 1px solid black !important;
margin-top: 8px;
}
</style>
</head>
<body>
<div ng-controller="demoCtrl" ng-app="MyApp">
<md-chips ng-model="myChips">
<md-chip-template>
<strong>{{$chip}}</strong>
</md-chip-template>
</md-chips>
</div>
</body>
,scripts
然后,您需要使用element.Remove()
最后,使用document.Text()
所以,最终的代码是
document.Find(scripts)
答案 1 :(得分:2)
如@Eric Pauley所示,我看TextTokens
&amp; StartTagTokens
。
这是我的解决方案
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
previousStartTokenTest := domDocTest.Token()
loopDomTest:
for {
tt := domDocTest.Next()
switch {
case tt == html.ErrorToken:
break loopDomTest // End of the document, done
case tt == html.StartTagToken:
previousStartTokenTest = domDocTest.Token()
case tt == html.TextToken:
if previousStartTokenTest.Data == "script" {
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
}
}