与Apache Spark: dealing with Option/Some/None in RDDs类似,我有一个通过df.mapPartitions
def mapToTopics(iterator: Iterator[RawRecords]): Iterator[TopicContent] = {
iterator.map(k => {
browser.parseString(k.content) >> elementList("doc").map(d => {
TopicContent((d >> text("docno")).head, (d >> text("text")).head, k.path)
})
})
}
还定义了以下内容:
@transient lazy val browser = JsoupBrowser()
case class TopicContent(topic: String, content: String, filepath: String)
case class RawRecords(path: String, content: String)
如果没有带文本的xml标签,则上面会抛出错误(NoSuchElementException
)(对于某些格式错误的文档会发生这种情况)
如何更正和简化此代码以正确处理选项?
当尝试使用util.Try时,如上面的链接所示并应用flatMap
我的代码将失败,而不是Element
它使用Char
try {
Some(TopicContent((d >> text("docno")).head, (d >> text("text")).head, k.path))
} catch {
case noelem: NoSuchElementException => {
println(d.head)
None
}
}
})
val flattended = results.flatten
遗憾的是,只会返回Option[Nothing]
https://gist.github.com/geoHeil/bfb01427b88cf58ea755f912ce539712没有火花的最小样本(以及下面的完整代码)
import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.scraper.ContentExtractors.elementList
@transient lazy val browser = JsoupBrowser()
val broken =
"""
|<docno>
| LA051089-0001
| </docno>
| <docid>
| 54901
| </docid>
| <date>
| <p> May 10, 1989, Wednesday, Home Edition </p>
| </date>
| <section>
| <p> Metro; Part 2; Page 2; Column 2 </p>
| </section>
| <graphic>
| <p> Photo, Cloudy and Clear A stormy afternoon provides a clear view of Los Angeles' skyline, with the still-emerging Library Tower rising above its companion buildings. KEN LUBAS / Los Angeles Times </p>
| </graphic>
| <type>
| <p> Wild Art </p>
| </type>
""".stripMargin
val correct =
"""
|<DOC>
|<DOCNO> FR940104-0-00001 </DOCNO>
|<PARENT> FR940104-0-00001 </PARENT>
|<TEXT>
|
|<!-- PJG FTAG 4700 -->
|
|<!-- PJG STAG 4700 -->
|
|<!-- PJG ITAG l=90 g=1 f=1 -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=90 g=1 f=4 -->
|Federal Register
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=90 g=1 f=1 -->
|␣/␣Vol. 59, No. 2␣/␣Tuesday, January 4, 1994␣/␣Rules and Regulations
|
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=01 g=1 f=1 -->
|Vol. 59, No. 2
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG ITAG l=02 g=1 f=1 -->
|Tuesday, January 4, 1994
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG 0012 frnewline -->
|
|<!-- PJG /ITAG -->
|
|<!-- PJG /STAG -->
|
|<!-- PJG /FTAG -->
|</TEXT>
|</DOC>
""".stripMargin
case class RawRecords(path: String, content: String)
case class TopicContent(topic: String, content: String, filepath: String)
val raw = Seq(RawRecords("first", correct), RawRecords("second", broken))
val result = mapToTopics(raw.iterator)
// Variant 1
def mapToTopics(iterator: Iterator[RawRecords]): Iterator[TopicContent] = {
iterator.flatMap(k => {
val documents = browser.parseString(k.content) >> elementList("doc")
documents.map(d => {
val docno = d >> text("docno")
// try {
val textContent = d >> text("text")
TopicContent(docno, textContent, k.path)
// } catch {
// case _:NoSuchElementException => TopicContent(docno, None, k.path)
// }
}) //.filter(_.content !=None)
})
}
// When broken down even further you see the following will produce Options of strings
browser.parseString(raw(0).content) >> elementList("doc").map(d => {
val docno = d >> text("docno")
val textContent = d >> text("text")
(docno.headOption, textContent.headOption)
})
// while below will now map to characters. What is wrong here?
val documents = browser.parseString(raw(0).content) >> elementList("doc")
documents.map(d => {
val docno = d >> text("docno")
val textContent = d >> text("text")
(docno.headOption, textContent.headOption)
})
答案 0 :(得分:1)
我不熟悉您使用的API,但在headOpton
理解中使用for
可能会对您有所帮助:
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._
iterator.map(k => {
browser.parseString(k.content) >> elementList("doc").flatMap(d => {
for {
docno <- text("docno")).headOption
text <- (d >> text("text")).headOption
} yield TopicContent(docno, text, k.path)
})
})
这样,当 TopicContent
和Some(TopicContent)
都存在时,您只构建docno
,真正构建text
- {{1 }} 除此以外。然后None
删除所有flatMap
并提取None
中的内容,为您留下为所有有效XML创建的Some
个实例的集合。
答案 1 :(得分:1)
两个例子之间的区别在于运营商的优先级。当您执行browser.parseString(raw(0).content) >> elementList("doc").map(...)
时,您在map
上呼叫elementList("doc")
,而不是整个表达式。为了使第一个示例的行为与第二个示例相同,您需要编写(browser.parseString(raw(0).content) >> elementList("doc")).map(...)
(推荐)或browser.parseString(raw(0).content) >> elementList("doc") map(...)
。
在scala-scraper的上下文中,你正在使用的库,这两个表达式意味着非常不同的东西。使用browser.parseString(raw(0).content) >> elementList("doc")
,您从文档中提取List[Element]
,并在其上调用map
就可以实现您对集合的期望。另一方面,elementList("doc")
是HtmlExtractor[List[Element]]
,并且在提取器上调用map
会创建一个新的HtmlExtractor
,其原始提取器的结果会被转换。这就是为什么你最终得到两个截然不同的结果的原因。