在Golang中从HTML中提取文本内容

时间:2014-01-08 15:48:36

标签: regex string go byte substring

在Golang中从字符串中提取内部子串的最佳方法是什么?

输入

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

输出

"this is paragraph \n
 this is paragraph 2"

Go是否有任何类似的字符串包/库?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph \n
    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

感谢

3 个答案:

答案 0 :(得分:5)

Don't use regular expressions尝试解释HTML。使用fully capable HTML tokenizer and parser

我建议您在CodingHorror上阅读this article

答案 1 :(得分:0)

StrExtract检索两个分隔符之间的字符串。

  

StrExtract(sExper,cAdelim,cCdelim,nOccur)

     

sExper:指定要搜索的表达式。 sAdelim:指定   用于分隔sExper开头的字符。

     

sCdelim:指定分隔sExper结尾的字符。

     

nOccur:指定sExper中cAdelim的启动次数   提取。

Go Play

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "a11ba22ba333ba4444ba55555ba666666b"
    fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

    aExper := strings.Split(sExper, sAdelim)

    if len(aExper) <= nOccur {
        return ""
    }

    sMember := aExper[nOccur]
    aExper = strings.Split(sMember, sCdelim)

    if len(aExper) == 1 {
        return ""
    }

    return aExper[0]
}

答案 2 :(得分:0)

这是我的功能,我一直在使用它。

y<-rnorm(27)
x1<-seq(1,3,1)
x2<-seq(10,30,10)
x3<-seq(100,300,100)

df<-expand.grid(x1,x2,x3)
df<-cbind(y,df)
names(df)<-c("y","x1","x2","x3")

# Task is to create a function like
#   fun(x1,x2,x3) --> interpolated y

# expected output example:
# fun(1,10,100) --> -0.89691454
# fun(1.5,10,100) --> -0.3560327

您可以在游乐场https://play.golang.org/p/Xo0SJu0Vq4尝试。