Question

我有一个固定长度的文件（示例如下所示），我想使用SCALA（不是python或java）在Spark中使用DataFrames API读取此文件。使用DataFrames API有一些方法可以读取textFile，json文件等，但不确定是否有办法读取固定长度的文件。我正在互联网上搜索这个并发现了一个github link，但是为了这个目的我必须下载spark-fixedwidth-assembly-1.0.jar，但我无法在任何地方弄清楚罐子。我完全迷失在这里，需要你的建议和帮助。 Stackoverflow中有几个帖子，但它们与Scala和DataFrame API无关。

这是文件

56 apple     TRUE 0.56
45 pear      FALSE1.34
34 raspberry TRUE 2.43
34 plum      TRUE 1.31
53 cherry    TRUE 1.4 
23 orange    FALSE2.34
56 persimmon FALSE23.2

每列的固定宽度为3,10,5,4

请提出您的意见。

Answer 1

嗯...使用子串来打破线条。然后修剪以移除wheitespaces。然后做任何你想做的事。

case class DataUnit(s1: Int, s2: String, s3:Boolean, s4:Double)

sc.textFile('your_file_path')
  .map(l => (l.substring(0, 3).trim(), l.substring(3, 13).trim(), l.substring(13,18).trim(), l.substring(18,22).trim()))
  .map({ case (e1, e2, e3, e4) => DataUnit(e1.toInt, e2, e3.toBoolean, e4.toDouble) })
  .toDF

Answer 2

固定长度格式很老，我找不到这种格式的好Scala库...所以我创建了自己的。

您可以在此处查看：https://github.com/atais/Fixed-Length

使用Spark非常简单，您可以得到DataSet个对象！

首先需要创建对象的描述，fe：

case class Employee(name: String, number: Option[Int], manager: Boolean)

object Employee {

    import com.github.atais.util.Read._
    import cats.implicits._
    import com.github.atais.util.Write._
    import Codec._

    implicit val employeeCodec: Codec[Employee] = {
      fixed[String](0, 10) <<:
        fixed[Option[Int]](10, 13, Alignment.Right) <<:
        fixed[Boolean](13, 18)
    }.as[Employee]
}

后来只使用解析器：

val input = sql.sparkContext.textFile(file)
               .filter(_.trim.nonEmpty)
               .map(Parser.decode[Employee])
               .flatMap {
                  case Right(x) => Some(x)
                  case Left(e) =>
                         System.err.println(s"Failed to process file $file, error: $e")
                         None
               }
sql.createDataset(input)

如何使用DataFrame API和SCALA在Spark中读取固定长度的文件

2 个答案: