如何遍历FSharp.Data HtmlDocument以将内容提取为字符串?

时间:2020-03-20 16:28:15

标签: f# f#-data

我想编写一个从此html获取我的函数:

<div>
  <h1>Some header.</h1>
  <ul>
    <li>
      <p>Hello world!</p>
    </li>
    <li>
      <p>What is going on? <a href="http://example.com">This is a link</a>.</p>
    </li>
  </ul>
</div>

此字符串:

一些标题。你好,世界!到底是怎么回事?这是一个链接。

换句话说:我想通过此测试:

let testInput: string = """
<div>
  <h1>Some header.</h1>
  <ul>
    <li>
      <p>Hello world!</p>
    </li>
    <li>
      <p>What is going on? <a href="http://example.com">This is a link</a>.</p>
    </li>
  </ul>
</div>
"""

let getContentsFromHtmlDocument (doc: HtmlDocument) =
  let getInner (node: HtmlNode): string =
    // How can I traverse this tree?
    ""
  let result =
    doc.Descendants ["h1"; "p"; "a"]
    |> Seq.map getInner
    |> List.ofSeq
    |> List.fold (+) ""
  result

[<Test>]
let Test1 () =
    let htmlDoc: HtmlDocument = HtmlDocument.Parse(testInput)
    let res = getContentsFromHtmlDocument htmlDoc
    Assert.AreEqual("Some header. Hello world! What is going on? This is a link.", res)

但是我在确定如何遍历树时遇到了麻烦。任何帮助,将不胜感激!谢谢。

1 个答案:

答案 0 :(得分:1)

DD_CntrMngFut.fillna(0,inplace=True) 中有一个扩展方法,它提供了通常用于遍历树的方法。对于您的特定用例,有HtmlNodeExtensions

尽管要通过测试,您需要用空格分隔内部文本,而HtmlNodeExtensions.DirectInnerText(n)可以更有效地完成内部文本。

String.Join

仍有问题:

let getContentsFromHtmlDocument (doc: HtmlDocument) =
    let getInner (node: HtmlNode): string =
        node.DirectInnerText()

    let result =
        doc.Descendants ["h1"; "p"; "a"]
        |> Seq.map getInner
        |> fun all -> String.Join(" ", all)

    result

这将加入:

<p>What is going on? <a href="http://example.com">This is a link</a>.</p> What is going on? . This is a link相对,后者无法用您当前的结构处理。