通过维护订单来聚合重复记录,并且还包括重复记录

时间:2018-02-14 08:01:18

标签: scala list time-complexity aggregate-functions

我正在尝试解决一个有趣的问题,很容易只做一个groupBy聚合,如总和,计数等。但这个问题略有不同。让我解释一下:

这是我的元组列表:

val repeatSmokers: List[(String, String, String, String, String, String)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )

这些记录的架构为(Idnumber, name, test_code, year, amount)。从这些元素中,我只想重复记录,我们在上面列表中定义唯一组合的方式是采用(sachin, kita MR.,56308)名称和test_code组合。这意味着如果相同的名称和test_code重复它是重复的吸烟者记录。为简单起见,您可以假设只将test_code作为唯一值,如果重复,您可以说它是重复的吸烟者记录。

  

下面是确切的输出:

ID76182,27539,1990,255,1 
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1 
ID76182,45873,1990,6770,2 
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

最后,这里具有挑战性的部分是维持每一个重复吸烟者记录的顺序和总和,并添加出现。

例如:此记录架构为:ID76182,47542,1990,536,2

IDNumber中,test_code,年,数量,OCCURENCES

因为它发生了两次我们看到2以上。

  

注意:

输出可以是列表是任何集合,但它应该与我上面提到的格式相同

3 个答案:

答案 0 :(得分:2)

所以这里是Scala中的一些代码,但它实际上是一个用Scala编写的Java代码:

import java.util.ArrayList
import java.util.LinkedHashMap
import scala.collection.convert._


type RawRecord = (String, String, String, String, String, String)
type Record = (String, String, String, String, Int, Int)
type RecordKey = (String, String, String, String)
type Output = (String, String, String, String, Int, Int, Int)
val keyF: Record => RecordKey = r => (r._1, r._2, r._3, r._4)
val repeatSmokersRaw: List[RawRecord] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )
val repeatSmokers = repeatSmokersRaw.map(r => (r._1, r._2, r._3, r._4, r._5.toInt, r._6.toInt))

val acc = new LinkedHashMap[RecordKey, (util.ArrayList[Output], Int, Int)]
repeatSmokers.foreach(r => {
  val key = keyF(r)
  var cur = acc.get(key)
  if (cur == null) {
    cur = (new ArrayList[Output](), 0, 0)
  }
  val nextCnt = cur._2 + 1
  val sum = cur._3 + r._6
  val output = (r._1, r._2, r._3, r._4, r._5, sum, nextCnt)
  cur._1.add(output)
  acc.put(key, (cur._1, nextCnt, sum))
})
val result = acc.values().asScala.filter(p => p._2 > 1).flatMap(p => p._1.asScala)
// or if you are clever you can merge filter and flatMap as
// val result = acc.values().asScala.flatMap(p => if (p._1.size > 1) p._1.asScala else Nil)

println(result.mkString("\n"))

打印

  

(ID76182,团伙,技能,27539,1990,255,1)   
(ID76182,团伙,技能,27539,1990,365,2)   
(ID76182,SEMI,GAUTAM A MR。,45873,1990,20,1)   
(ID76182,SEMI,GAUTAM A MR。,45873,1990,6770,2)   
(ID76182,SEMI,GAUTAM A MR。,45873,1990,9370,3)   
(ID76182,DRAGON,WARS,49337,1990,200,1)   
(ID76182,DRAGON,WARS,49337,1990,570,2)   
(ID76182,HULK,PAIN MR。,47542,1990,280,1)   
(ID76182,HULK,PAIN MR。,47542,1990,536,2)

此代码中的主要技巧是使用Java的LinkedHashMap作为累加器集合,因为它保留了插入顺序。另外一个技巧是在里面存储一些列表(因为我使用Java集合,我决定使用ArrayList作为内部累加器,但你可以使用你喜欢的任何东西)。所以我的想法是建立一个key =>的地图。吸烟者列表以及每个钥匙商店当前计数器和当前总和,以便“聚合”吸烟者可以添加到列表中。构建地图时,通过它来过滤掉那些尚未累积至少2条记录的密钥,然后将列表地图转换为单个列表(这就是LinkedHashMap是重要的点。因为迭代期间保留了插入顺序而使用了

答案 1 :(得分:2)

以下是解决此问题的功能方法:

对于此输入:

case class Record(
    id: String,
    fname: String,
    lname: String,
    code: String,
    year: String,
    amount: String)

使用表示记录的案例类:

val result = repeatSmokers
  .map(recordTuple => Record.tupled(recordTuple))
  .zipWithIndex
  .groupBy { case (record, order) => (record.fname, record.lname, record.code) }
  .flatMap {

    case (_, List(singleRecord)) => Nil // get rid of non-repeat records

    case (key, records) => {

      val firstKeyIdx = records.head._2

      val amounts = records.map {
        case (record, order) => record.amount.toInt
      }.foldLeft(List[Int]()) {
        case (Nil, addAmount) => List(addAmount)
        case (previousAmounts :+ lastAmount, addAmount) =>
          previousAmounts :+ lastAmount :+ (lastAmount + addAmount)
      }

      records
        .zip(amounts)
        .zipWithIndex
        .map {
          case (((rec, order), amount), idx) =>
            val serializedRecord =
              List(rec.id, rec.code, rec.year, amount, idx + 1)
            (serializedRecord.mkString(","), (firstKeyIdx, idx))
        }
    }
  }
  .toList
  .sortBy { case (serializedRecord, finalOrder) => finalOrder }
  .map { case (serializedRecord, finalOrder) => serializedRecord }

我们可以执行以下操作:

ID76182,27539,1990,255,1
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1
ID76182,45873,1990,6770,2
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

这会产生:

.map(recordTuple => Record.tupled(recordTuple))

一些解释:

从元组中实例化案例类的一种非常好的方法(从元组列表创建记录列表):

.zipWithIndex

每个记录都使用其全局索引(记录,索引)进行元组处理,以便能够使用后者处理订单:

.groupBy { case (record, order) => (record.fname, record.lname, record.code) }

然后我们使用您需要的密钥进行分组:

case (_, List(singleRecord)) => Nil

然后,对于组阶段产生的每个键/值,我们将输出记录列表(如果值是单个记录,则输出空列表)。因此,flatMap会使将要生成的列表变得扁平化。

以下是摆脱单一记录的部分:

val amounts = records.map {
    case (record, order) => record.amount.toInt
  }.foldLeft(List[Int]()) {
    case (Nil, addAmount) => List(addAmount)
    case (previousAmounts :+ lastAmount, addAmount) =>
      previousAmounts :+ lastAmount :+ (lastAmount + addAmount)
  }

另一个案例涉及累积金额的创建(这是一个Int的列表)(对于Spark开发人员的说明:groupBy确实保留给定键中值元素的顺序):

records
    .zip(amounts)
    .zipWithIndex
    .map {
      case (((rec, order), amount), idx) =>
        val serializedRecord =
          List(rec.id, rec.code, rec.year, amount, idx + 1).mkString(
            ",")
        (serializedRecord, (firstKeyIdx, idx))
    }

这些金额将被压缩回记录,以便使用给定的累计金额修改每个记录金额。并且它也将记录序列化为最终所需的格式:

.sortBy { case (serializedRecord, finalOrder) => finalOrder }

前一部分还用其索引压缩了记录。事实上,每个序列化记录都提供了一个元组(firstKeyIdx,idx),用于根据需要对每个记录进行排序(首先是每个键的显示顺序(firstKeyIdx),然后是来自同一个键的记录,&# 34;嵌套"顺序由idx定义:

import splinter
from selenium import webdriver
from shutil import copyfile
options = webdriver.ChromeOptions()
prefs = {
    "download.default_directory" : "C:/Users/joshuaclew/Downloads/",
    "download.directory_upgrade": "true",
    "download.prompt_for_download": "false",
    "disable-popup-blocking": "true"

}
chrome_options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("--disable-infobars")
browser = splinter.Browser('chrome', options=chrome_options)
browser.driver.maximize_window()
browser.visit('http://www.someurlhere.com')
exportButton = browser.find_by_id("exportButton")
exportButton.click()
download_path = 'C:/Users/joshuaclew/Downloads/'
old_file_name = download_path+'old_file_name'
new_file_name = download_path+'new_file_name'
copyfile(old_file_name, new_file_name)

答案 2 :(得分:1)

这是一个解决这个问题的功能/递归方法,基于@SergGr的解决方案,它正确地引入了LinkedHashMap。

鉴于此输入:

val repeatSmokers: List[(String, String, String, String, String, Int)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", 300),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", 100),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", 255),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", 110),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", 20),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", 6750),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", 2090),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", 200),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", 280),
    ("ID76182", "JAMES", "JIM", "30548", "1990", 300),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", 2600),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", 370),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", 2600),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", 2600),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", 256)
  )

首先以这种方式准备和汇总数据:

case class Record(
  id: String, fname: String, lname: String,
  code: String, year: String, var amount: Int
)

case class Key(fname: String, lname: String, code: String)

val preparedRecords: List[(Key, Record)] = repeatSmokers.map {
  case recordTuple @ (_, fname, lname, code, _, _) =>
    (Key(fname, lname, code), Record.tupled(recordTuple))
}

import scala.collection.mutable.LinkedHashMap

def aggregateDuplicatesWithOrder(
    remainingRecords: List[(Key, Record)],
    processedRecords: LinkedHashMap[Key, List[Record]]
): LinkedHashMap[Key, List[Record]] =
  remainingRecords match {

    case (key, record) :: newRemainingRecords => {

      processedRecords.get(key) match {
        case Some(recordList :+ lastRecord) => {
          record.amount = record.amount + lastRecord.amount
          processedRecords.update(key, recordList :+ lastRecord :+ record)
        }
        case None => processedRecords(key) = List(record)
      }

      aggregateDuplicatesWithOrder(newRemainingRecords, processedRecords)
    }

    case Nil => processedRecords
  }

val result = aggregateDuplicatesWithOrder(
  preparedRecords, LinkedHashMap[Key, List[Record]]()
).values.flatMap {
  case _ :: Nil => Nil
  case records =>
    records.zipWithIndex.map { case (rec, idx) =>
      List(rec.id, rec.code, rec.year, rec.amount, idx + 1).mkString(",")
    }
}