spark数据帧句柄选项有些没有

时间:2017-04-03 14:31:36

标签: scala apache-spark spark-dataframe optional

Apache Spark: dealing with Option/Some/None in RDDs类似,我有一个通过df.mapPartitions

应用的功能
def mapToTopics(iterator: Iterator[RawRecords]): Iterator[TopicContent] = {
    iterator.map(k => {
      browser.parseString(k.content) >> elementList("doc").map(d => {
        TopicContent((d >> text("docno")).head, (d >> text("text")).head, k.path)
      })
    })
  }

还定义了以下内容:

@transient lazy val browser = JsoupBrowser()
case class TopicContent(topic: String, content: String, filepath: String)
case class RawRecords(path: String, content: String)

如果没有带文本的xml标签,则上面会抛出错误(NoSuchElementException)(对于某些格式错误的文档会发生这种情况) 如何更正和简化此代码以正确处理选项?

当尝试使用util.Try时,如上面的链接所示并应用flatMap我的代码将失败,而不是Element它使用Char

修改

try {
              Some(TopicContent((d >> text("docno")).head, (d >> text("text")).head, k.path))
            } catch {
              case noelem: NoSuchElementException => {
                println(d.head)
                None
              }
            }
          })
val flattended = results.flatten

遗憾的是,只会返回Option[Nothing]

edit4

https://gist.github.com/geoHeil/bfb01427b88cf58ea755f912ce539712没有火花的最小样本(以及下面的完整代码)

import net.ruippeixotog.scalascraper.browser.JsoupBrowser
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.scraper.ContentExtractors.elementList

@transient lazy val browser = JsoupBrowser()
val broken =
  """
    |<docno>
    |   LA051089-0001
    | </docno>
    | <docid>
    |   54901
    | </docid>
    | <date>
    |  <p> May 10, 1989, Wednesday, Home Edition </p>
    | </date>
    | <section>
    |  <p> Metro; Part 2; Page 2; Column 2 </p>
    | </section>
    | <graphic>
    |  <p> Photo, Cloudy and Clear A stormy afternoon provides a clear view of Los Angeles' skyline, with the still-emerging Library Tower rising above its companion buildings. KEN LUBAS / Los Angeles Times </p>
    | </graphic>
    | <type>
    |  <p> Wild Art </p>
    | </type>
  """.stripMargin
val correct =
  """
    |<DOC>
    |<DOCNO> FR940104-0-00001 </DOCNO>
    |<PARENT> FR940104-0-00001 </PARENT>
    |<TEXT>
    |
    |<!-- PJG FTAG 4700 -->
    |
    |<!-- PJG STAG 4700 -->
    |
    |<!-- PJG ITAG l=90 g=1 f=1 -->
    |
    |<!-- PJG /ITAG -->
    |
    |<!-- PJG ITAG l=90 g=1 f=4 -->
    |Federal Register
    |<!-- PJG /ITAG -->
    |
    |<!-- PJG ITAG l=90 g=1 f=1 -->
    |&blank;/&blank;Vol. 59, No. 2&blank;/&blank;Tuesday, January 4, 1994&blank;/&blank;Rules and Regulations
    |
    |<!-- PJG 0012 frnewline -->
    |
    |<!-- PJG /ITAG -->
    |
    |<!-- PJG ITAG l=01 g=1 f=1 -->
    |Vol. 59, No. 2
    |<!-- PJG 0012 frnewline -->
    |
    |<!-- PJG /ITAG -->
    |
    |<!-- PJG ITAG l=02 g=1 f=1 -->
    |Tuesday, January 4, 1994
    |<!-- PJG 0012 frnewline -->
    |
    |<!-- PJG 0012 frnewline -->
    |
    |<!-- PJG /ITAG -->
    |
    |<!-- PJG /STAG -->
    |
    |<!-- PJG /FTAG -->
    |</TEXT>
    |</DOC>
  """.stripMargin
case class RawRecords(path: String, content: String)

case class TopicContent(topic: String, content: String, filepath: String)
val raw = Seq(RawRecords("first", correct), RawRecords("second", broken))
val result = mapToTopics(raw.iterator)

// Variant 1
def mapToTopics(iterator: Iterator[RawRecords]): Iterator[TopicContent] = {
  iterator.flatMap(k => {
    val documents = browser.parseString(k.content) >> elementList("doc")
    documents.map(d => {
      val docno = d >> text("docno")
      //        try {
      val textContent = d >> text("text")
      TopicContent(docno, textContent, k.path)
      //        } catch {
      //          case _:NoSuchElementException => TopicContent(docno, None, k.path)
      //        }
    }) //.filter(_.content !=None)
  })
}


// When broken down even further you see the following will produce Options of strings
browser.parseString(raw(0).content) >> elementList("doc").map(d => {
  val docno = d >> text("docno")
  val textContent = d >> text("text")
  (docno.headOption, textContent.headOption)
})

// while below will now map to characters. What is wrong here?
val documents = browser.parseString(raw(0).content) >> elementList("doc")
  documents.map(d => {
  val docno = d >> text("docno")
  val textContent = d >> text("text")
  (docno.headOption, textContent.headOption)
})

2 个答案:

答案 0 :(得分:1)

我不熟悉您使用的API,但在headOpton理解中使用for可能会对您有所帮助:

import net.ruippeixotog.scalascraper.dsl.DSL._
import net.ruippeixotog.scalascraper.dsl.DSL.Extract._
import net.ruippeixotog.scalascraper.dsl.DSL.Parse._

iterator.map(k => {
      browser.parseString(k.content) >> elementList("doc").flatMap(d => {
        for {
           docno <- text("docno")).headOption
           text <- (d >> text("text")).headOption
        } yield TopicContent(docno, text, k.path)
      })
})

这样,当 TopicContentSome(TopicContent)都存在时,您只构建docno,真正构建text - {{1 }} 除此以外。然后None删除所有flatMap并提取None中的内容,为您留下为所有有效XML创建的Some个实例的集合。

答案 1 :(得分:1)

两个例子之间的区别在于运营商的优先级。当您执行browser.parseString(raw(0).content) >> elementList("doc").map(...)时,您在map上呼叫elementList("doc"),而不是整个表达式。为了使第一个示例的行为与第二个示例相同,您需要编写(browser.parseString(raw(0).content) >> elementList("doc")).map(...)(推荐)或browser.parseString(raw(0).content) >> elementList("doc") map(...)

在scala-scraper的上下文中,你正在使用的库,这两个表达式意味着非常不同的东西。使用browser.parseString(raw(0).content) >> elementList("doc"),您从文档中提取List[Element],并在其上调用map就可以实现您对集合的期望。另一方面,elementList("doc")HtmlExtractor[List[Element]],并且在提取器上调用map会创建一个新的HtmlExtractor,其原始提取器的结果会被转换。这就是为什么你最终得到两个截然不同的结果的原因。