如何过滤数据集

时间:2017-01-24 23:41:03

标签: scala apache-spark spark-streaming

我有以下数据:

List(Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187513, member_id -> 111, category -> web, field1 -> abc), 
     Map(event_id -> DEF, event_name -> added, timestamp -> 1478187520, member_id -> 111),
     Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187522, member_id -> 111, category -> web, field1 -> abc),
     Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187618, member_id -> 111, category -> web, field1 -> abc))
List(Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187618, member_id -> 222, category -> web, field1 -> def))
List(Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187513, member_id -> 333, category -> web, field1 -> abc), 
     Map(event_id -> DEF, event_name -> added, timestamp -> 1478187520, member_id -> 333),
     Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187522, member_id -> 333, category -> web, field1 -> def),
     Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187618, member_id -> 333, category -> web, field1 -> abc))

如何删除List[Map[..]]内至少一个条目field1等于def的所有Map元素?

结果应该是这个:

List(Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187513, member_id -> 111, category -> web, field1 -> abc), 
     Map(event_id -> DEF, event_name -> added, timestamp -> 1478187520, member_id -> 111),
     Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187522, member_id -> 111, category -> web, field1 -> abc),
     Map(event_id -> ABC, event_name -> visited, timestamp -> 1478187618, member_id -> 111, category -> web, field1 -> abc))

这是我的草稿代码,但我无法编译它:

            val result = dataset.filter({
              list => !list.exists(t => t.getOrElse("field1","").equals("def"))
            })

2 个答案:

答案 0 :(得分:1)

scala> val data = List(List(Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187513", "member_id" -> "111", "category" -> "web", "field1" -> "abc"),
     |       Map("event_id" -> "DEF", "event_name" -> "added", "timestamp" -> "1478187520", "member_id" -> "111"),
     |       Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187522", "member_id" -> "111", "category" -> "web", "field1" -> "abc"),
     |       Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187618", "member_id" -> "111", "category" -> "web", "field1" -> "abc")),
     |       List(Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187618", "member_id" -> "222", "category" -> "web", "field1" -> "def")),
     |       List(Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187513", "member_id" -> "333", "category" -> "web", "field1" -> "abc"),
     |         Map("event_id" -> "DEF", "event_name" -> "added", "timestamp" -> "1478187520", "member_id" -> "333"),
     |         Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187522", "member_id" -> "333", "category" -> "web", "field1" -> "def"),
     |         Map("event_id" -> "ABC", "event_name" -> "visited", "timestamp" -> "1478187618", "member_id" -> "333", "category" -> "web", "field1" -> "abc")))


scala> def filterData(xs: List[List[Map[String, String]]]): List[List[Map[String, String]]] = {
     | xs.filter(sumList => !sumList.exists(x => x.getOrElse("field1", "").equals("def")))
     | }
filterData: (xs: List[List[Map[String,String]]])List[List[Map[String,String]]]


scala> val output = filterData(data)
output: List[List[Map[String,String]]] = List(List(Map(timestamp -> 1478187513, field1 -> abc, event_name -> visited, category -> web, member_id -> 111, event_id -> ABC), Map(event_id -> DEF, event_name -> added, timestamp -> 1478187520, member_id -> 111), Map(timestamp -> 1478187522, field1 -> abc, event_name -> visited, category -> web, member_id -> 111, event_id -> ABC), Map(timestamp -> 1478187618, field1 -> abc, event_name -> visited, category -> web, member_id -> 111, event_id -> ABC)))

答案 1 :(得分:1)

假设您的密钥为String,并且您将这些List连接成一个大的List - 即。一个名为List[List[Map[String, Any]]]的{​​{1}},您就可以执行此操作:

data

顺便说一下,这种数据结构非常复杂,这使得它更难以推理,因此更难以看到如何处理它。它的复杂性也会使分区策略更加困难,这可能导致性能不佳。我建议你考虑简化数据模型的方法。