我想编写一个从此html获取我的函数:
<div>
<h1>Some header.</h1>
<ul>
<li>
<p>Hello world!</p>
</li>
<li>
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
</li>
</ul>
</div>
此字符串:
一些标题。你好,世界!到底是怎么回事?这是一个链接。
换句话说:我想通过此测试:
let testInput: string = """
<div>
<h1>Some header.</h1>
<ul>
<li>
<p>Hello world!</p>
</li>
<li>
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
</li>
</ul>
</div>
"""
let getContentsFromHtmlDocument (doc: HtmlDocument) =
let getInner (node: HtmlNode): string =
// How can I traverse this tree?
""
let result =
doc.Descendants ["h1"; "p"; "a"]
|> Seq.map getInner
|> List.ofSeq
|> List.fold (+) ""
result
[<Test>]
let Test1 () =
let htmlDoc: HtmlDocument = HtmlDocument.Parse(testInput)
let res = getContentsFromHtmlDocument htmlDoc
Assert.AreEqual("Some header. Hello world! What is going on? This is a link.", res)
但是我在确定如何遍历树时遇到了麻烦。任何帮助,将不胜感激!谢谢。
答案 0 :(得分:1)
DD_CntrMngFut.fillna(0,inplace=True)
中有一个扩展方法,它提供了通常用于遍历树的方法。对于您的特定用例,有HtmlNodeExtensions
。
尽管要通过测试,您需要用空格分隔内部文本,而HtmlNodeExtensions.DirectInnerText(n)
可以更有效地完成内部文本。
String.Join
仍有问题:
let getContentsFromHtmlDocument (doc: HtmlDocument) =
let getInner (node: HtmlNode): string =
node.DirectInnerText()
let result =
doc.Descendants ["h1"; "p"; "a"]
|> Seq.map getInner
|> fun all -> String.Join(" ", all)
result
这将加入:
<p>What is going on? <a href="http://example.com">This is a link</a>.</p>
与What is going on? . This is a link
相对,后者无法用您当前的结构处理。