如何在O(n * log(n))中计算一个列表与另一个列表?

时间:2015-10-09 06:46:43

标签: algorithm haskell

我正在寻找一个函数,它可以有效地计算一个列表中每个元素在另一个列表中的出现次数。它应该返回元素/计数元组的排序列表。这是规范:

countList :: Ord a => [a] -> [a] -> [(a, Integer)]
countList ['a', 'b', 'c', 'b', 'b'] ['a', 'b', 'x']
                               == [('a', 1), ('b', 3), ('x', 0)]
length (countList xs ys) == length ys

一个天真的实现将是:

countList xs = sort . map (id &&& length . (\ y -> filter (== y) xs))

这是O(n^2)。但是,由于我们有Ord a,因此使用更好的策略可以加快速度。我们可能先对两个列表进行排序,然后在“爬梯”中对它们进行比较。 FASION。

例如,以下是排序的两个列表。如果我必须这样做,我会使用指向每个列表中第一个元素的两个指针:

       i
       |
xs = ['a', 'b', 'b', 'b', 'c']
ys = ['a', 'b', 'x']
       |
       j

然后在i时增加xs !! i == ys !! j,同时在位置j向计数器添加一个i。当ys遇到新元素时,请通过增加jO(n*log(n))中找到它,然后重复上一步。此算法为val RDDorg = sc.textFile("output.txt") val RDDstart = RDDorg.map(line => line.split("#")) val rddPersons = RDDstart.map(line => line(0)).union(RDDstart.map(line=>line(2))).distinct().zipWithIndex() val verticesRDD = rddPersons.map(_.swap) val rddSender = rddPersons.join(RDDstart.map(line => (line(0),line(1)))).values.map(_.swap).distinct() val rddReceiver = rddPersons.join(RDDstart.map(line => (line(2),line(1)))).values.map(_.swap).distinct() val msgid= RDDstart.map(line => line(1)).distinct().zipWithIndex().map(_.swap) val mrg = verticesRDD.union(msgid).zipWithIndex().map(_.swap) val distVertices = (mrg.map(line => (line._1,line._2._2))).map(_.swap) val rddEdge = rddSender.join(rddReceiver).map(line => (line._2._1,line._2._2,line._1)) val mrgrdd = distVertices.join(rddEdge.map(line=> (line._3,line._2))) val nxt_2 = (mrgrdd.map(line => (line._1,line._2._2,line._2._1))) val test = nxt_2.map(line=> (line._1,line._3)) val test2 = test.join(rddEdge.map(line => (line._3,line._2))) val test3 = test.join(rddEdge.map(line => (line._3,line._1))) val final_coll = test3.join(test2) val senderEdge = final_coll.map(line => (line._2._1._2, line._2._1._1,line._1)) val RcvrEdge = final_coll.map(line => (line._2._2._1, line._2._2._2,line._1)) val FinalEdge = senderEdge.union(RcvrEdge) val edges: RDD[Edge[String]] = FinalEdge.map { line => Edge(line._1,line._2,line._3 ) }.distinct() ///////////////////////////////////// class rootclass{} case class UserNode(sName:String, sno:String, sDept:String) extends rootclass with Serializable case class MessageNode(mID:String, received:String, toAddr:String, ccAddr:String, dkim:String, recv_spf:String, in_reply:String, references:String, return_path:String, from:String, subject:String, content_type:String, dateTime:String, del_dateTime:String, envelopTo:String, deliveredTo:String, importance:String, userAgent:String, Xpriority:Int) extends rootclass with Serializable val users: RDD[rootclass] = verticesRDD.map { line =>val cols = line._2.split(",") (new UserNode(cols(0), cols(1), cols(2))) } val Message: RDD[rootclass] = msgid.map { line => val cols = line._2.split("%") (new MessageNode(cols(0), cols(1), cols(2), cols(3), cols(4), cols(5),cols(6),cols(7), cols(8), cols(9), cols(10), cols(11), cols(12), cols(13), cols(14), cols(15),cols(16), cols(17),cols(18).toInt)) } val nodes = (users union Message).zipWithIndex().map(_.swap) val graph = Graph.apply(nodes, edges) val num= graph.numVertices

但是我找不到以纯粹功能的方式实现相同复杂性的方法,也没有找到能够实现我想要的任何现有功能。我应该如何在Haskell中做到这一点?

2 个答案:

答案 0 :(得分:5)

如果第二个列表没有重复项,并且第一个列表较长,则可以使用Data.Map 避免对第一个列表进行排序。这将实现n1 log n2复杂度:

import Data.Map (fromList, toList, adjust)

countList :: Ord a => [a] -> [a] -> [(a, Int)]
countList l r = toList $ foldr (adjust (+1)) (fromList . zip r $ repeat 0) l

答案 1 :(得分:3)

我认为这可以实现您的目标:

import Data.List (sort)

countList :: Ord a => [a] -> [a] -> [(a, Int)]
countList l1 l2 = countList' (sort l1) (sort l2)
  where countList' _     []  = []
        countList' xs (y:ys) = let xs'   = dropWhile (<  y) xs
                                   (a,b) = span      (== y) xs'
                                in (y, length a) : countList' b ys

main = print $ countList ['a', 'b', 'c', 'b', 'b'] ['a', 'b', 'x']