使用scala XML支持解析HTML页面 - 过滤掉数据

时间:2013-11-06 12:57:02

标签: xml scala xml-parsing pattern-matching web-scraping

抓取网站并接收HTML页面。

该页面包含一些带行的表

  

(演员 - >角色)

例如:

  

(演员= Jason Priestley - >角色= Brandon Walsh)

有时行中缺少“actor”或“role”

(期待2时有1列的行)

文件示例:

<div id="90210">
      <h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
      <table class="actors">
        <tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
        <tr><td class="actor">Shannen Doherty</td></tr>
        <tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
      </table>
</div>

无法过滤掉仅包含1列的行:

我的代码:

  def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
    val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
    beverlyHillsData match {
      case Some(data) => {
        val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
        val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
        val roles  = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role")  map {_.text}
        actors zip roles  toMap
      }
      case None => Map()
    }
  }

主要关注的是:

val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )

如何过滤出更精确的坏行(没有_.toString())

有什么建议吗?

1 个答案:

答案 0 :(得分:1)

你可以

def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

val goodRows = data \\ "tr" filter actorWithRole

我还会更改数据提取以保持actor /角色对不变。我需要更多时间来找出一个干净的解决方案

我的建议是

def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {

  def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

  def rowToEntry(r: Node) =
    r \ "td" map (_.text) match {
      case actor :: role :: Nil => (actor -> role)
    }  

  val beverlyHillsData = page \\ "div" find whereId("90210")

  beverlyHillsData match {
    case Some(data) => {
      val goodRows = data \\ "tr" filter actorWithRole
      val entries = goodRows map rowToEntry
      entries.toMap
    }
    case None => Map()
  }
}