在scala中表示嵌套结构

时间:2014-12-12 01:07:06

标签: scala apache-spark

我有一个稀疏表,在某些行中有嵌套子表,如下所示,我如何用scala集合表示这个结构

| rowkey |  orderid  |      name   |    amount    |     supplier      |   account

| rowkey1|id0: 1001  |id1: "apple" |  id1: 1000   | id3: "fruits, inc"|
                     |id2: "apple2"|  id2: 1200   |                   | 

| rowkey2|id4: 1002  |id5: "orange"|  id5: 5000   |                   | 

| rowkey3|id6: 1003  |id7: "pear"  |  id7: 500    |                   |id10: 77777
                     |id8: "pear2"  |  id8: 350    |                   | 
                     |id9: "pear3"  |  id9: 500    |                   | 

注意:id1,2,3,..代表每个“组属性”的唯一标识符,它基本上是每个子行的groupid,例如:在第一行“| id2 :”apple2“| id2 :1200”属于同一组 id2 (具有两个属性的子行) (rowkey1)下的(名称和金额)

另一种查看这3行的方法:

    rowkey1, (orderid, id0, 1001), (name, id1, "apple"), (amount, id1, 1000), (name, id2, "apple2"), (amount, id2,1200), (supplier, id3, "fruit inc.")
    rowkey2, (orderid, id4, 1002), (name, id5, "orange"), (amount, id5,5000)
    rowkey3, (orderid, id6, 1003), (name, id7, "pear"), (amount, id7,500),(name, id8, "pear2"), (amount, id8,350),(name, id9, "pear3"), (amount, id9, 250), (account, id10, 777777)

编辑:请注意,该表有2000列,是否可以动态创建类(或向类添加属性),例如从Scala中的外部文件加载字段名称和类型?我知道案例类限于22个字段

edit2:还要注意,任何属性都可以有多行(rowkey除外),即orderid,name,amount,supplier,account和1995+其他列,因此为所有这些列创建单独的“singleLine”类不是可行,我正在寻找最通用的解决方案。

感谢您的回答,我想让它更通用我可以创建这些类:

case class ColumnLine(
  id: Int,
  value: Option[Any]
)
case class Column(
  colname: String,
  coltype: String,
  lines: Option[List[ColumnLine]]
)
case class Row (
  rowkey:String,
  columns:Map[String,Column] //colname -> Column
)
case class Table (
  name:String,
  rows:Map[String,Row] //rowkey -> Row
)

现在我想弄清楚如何查询这个结构,即返回列colname ==“amount”的行包含值> 500

的行

edit3:好的,这是“快而又脏”的方式,但似乎有效,它在我的笔记本电脑上扫描了大约15秒的10M记录

import scala.util.control.Breaks._

object hello{

def main(args: Array[String]) {
    val n = 10000000
    def uuid = java.util.UUID.randomUUID.toString
    val row: Row = new Row(uuid, List(
                Column("orderid", "String", List(Single("id2",Some(uuid)))),
                Column("name", "String", List(Single("id2",Some("apple")),Single("id3",Some("apple2")))),
                Column("amount", "Int", List(Single("id2",Some(1000)),Single("id3",Some(1200)))),
                Column("supplier", "String", List(Single("id4",Some("fruits.inc")))),
                Column("account", "Int", List(Single("id10",Some(7777))))
                           )
            )
    println(new java.util.Date)
    val table: List[Row]= List.fill(n)(row)
    table.par.filter(row=> gt(row, "amount",500))
    .filter(row=> eq(row, "supplier","fruits.inc"))
    .filter(row=> eq(row, "account", 7777))
    //.foreach(println)
    println(new java.util.Date)

}

def eq (row:Row, colname: String, colvalue:Any): Boolean = {
    var res:Boolean = false
    val col:Column = getCol(row,colname) 
    breakable{ 
        for (line <- col.lines){ 
            if (line.value.getOrElse()==colvalue){
                res = true
                break
            }
        }
    }
    return res
}

def gt (row:Row, colname: String, colvalue:Int): Boolean = {
        var res:Boolean = false
        val col:Column = getCol(row,colname)
        breakable{
                for (line <- col.lines){
                        if (line.value.getOrElse().asInstanceOf[Int]>colvalue){
                                res = true
                                break
                        }
                }
        }
        return res
}

def getCol(row: Row, colname: String) : Column =
  row.columns.filter(_.colname==colname).head

case class Single(id: String, value: Option[Any])

case class Column(
  colname: String,
  coltype: String,
  lines: List[Single]
)

case class Row(
   rowkey: String,
   columns: List[Column]
)

}

2 个答案:

答案 0 :(得分:1)

有很多方法。例如,您可以定义以下内容:

case class OrderLine(
  name:String,
  amount:Int,
  supplier:Option[String],
  account:Option[String]
)

case class Order(
  rowkey:String,
  orderid:String,
  orders:Seq[OrderLine]
)

然后(这只是为了创建上面的例子;从文件读取2000行,当然会有所不同,但你明白了):

   val myOrders: Seq[Order] =
     Seq(
       Order("rowkey1", "1001", Seq(
         OrderLine("apple", 1000, Some("fruits, inc"), None),
         OrderLine("apple2", 1200, None, None)
       )),
       Order("rowkey2", "1002", Seq(
         OrderLine("orange", 5000, None, None)
       )),
       Order("rowkey3", "1003", Seq(
         OrderLine("pear", 500, None, Some("77777")),
         OrderLine("pear", 350, None, None),
         OrderLine("pear", 500, None, None)
       ))
     )

从外部文件加载数据的代码取决于外部文件的结构。基本上,我会创建一个函数来从文件中读取OrderLine,以及一个函数来读取Order(反过来,它使用该函数来读取OrderLine)。这些将是将2000行组装成内存数据结构的基本构建块。

答案 1 :(得分:1)

在Scala中表示这种情况的最自然的方式,假设列结构可以被视为固定的,就像

case class Single(name: String, amount: Int)

case class SingleEntry(
  orderid: Int,
  name: String,
  amount: Int,
  supplier: Option[Int],
  account: Option[Long]
)

case class Entry(
  orderid: Int,
  items: List[Single],
  supplier: Option[String],
  account: Option[Long]
) {
  def singly(p: Single => Boolean): List[SingleEntry] =
    items.filter(p).map{ case(name, amount) =>
      SingleEntry(orderid, name, amount, supplier, account)
    }
}

然后拿出你想要的物品,你会

table.
  filter(_.supplier.exists(_ == "fruits.inc")).
  flatMap(_.singly(_.amount > 500))

但是有很多方法可以表示这种数据结构,包括地图(嵌套或其他);我不会采取任何特定的答案作为规范。