抓取网站并接收HTML页面。
该页面包含一些带行的表
(演员 - >角色)
例如:
(演员= Jason Priestley - >角色= Brandon Walsh)
有时行中缺少“actor”或“role”
(期待2时有1列的行)
文件示例:
<div id="90210">
<h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
<table class="actors">
<tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
<tr><td class="actor">Shannen Doherty</td></tr>
<tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
</table>
</div>
无法过滤掉仅包含1列的行:
我的代码:
def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
beverlyHillsData match {
case Some(data) => {
val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
val roles = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role") map {_.text}
actors zip roles toMap
}
case None => Map()
}
}
主要关注的是:
val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
如何过滤出更精确的坏行(没有_.toString())
有什么建议吗?
答案 0 :(得分:1)
你可以
def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))
val goodRows = data \\ "tr" filter actorWithRole
我还会更改数据提取以保持actor /角色对不变。我需要更多时间来找出一个干净的解决方案
我的建议是
def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))
def rowToEntry(r: Node) =
r \ "td" map (_.text) match {
case actor :: role :: Nil => (actor -> role)
}
val beverlyHillsData = page \\ "div" find whereId("90210")
beverlyHillsData match {
case Some(data) => {
val goodRows = data \\ "tr" filter actorWithRole
val entries = goodRows map rowToEntry
entries.toMap
}
case None => Map()
}
}