我正在寻找一个函数,它可以有效地计算一个列表中每个元素在另一个列表中的出现次数。它应该返回元素/计数元组的排序列表。这是规范:
countList :: Ord a => [a] -> [a] -> [(a, Integer)]
countList ['a', 'b', 'c', 'b', 'b'] ['a', 'b', 'x']
== [('a', 1), ('b', 3), ('x', 0)]
length (countList xs ys) == length ys
一个天真的实现将是:
countList xs = sort . map (id &&& length . (\ y -> filter (== y) xs))
这是O(n^2)
。但是,由于我们有Ord a
,因此使用更好的策略可以加快速度。我们可能先对两个列表进行排序,然后在“爬梯”中对它们进行比较。 FASION。
例如,以下是排序的两个列表。如果我必须这样做,我会使用指向每个列表中第一个元素的两个指针:
i
|
xs = ['a', 'b', 'b', 'b', 'c']
ys = ['a', 'b', 'x']
|
j
然后在i
时增加xs !! i == ys !! j
,同时在位置j
向计数器添加一个i
。当ys
遇到新元素时,请通过增加j
在O(n*log(n))
中找到它,然后重复上一步。此算法为val RDDorg = sc.textFile("output.txt")
val RDDstart = RDDorg.map(line => line.split("#"))
val rddPersons = RDDstart.map(line => line(0)).union(RDDstart.map(line=>line(2))).distinct().zipWithIndex()
val verticesRDD = rddPersons.map(_.swap)
val rddSender = rddPersons.join(RDDstart.map(line => (line(0),line(1)))).values.map(_.swap).distinct()
val rddReceiver = rddPersons.join(RDDstart.map(line => (line(2),line(1)))).values.map(_.swap).distinct()
val msgid= RDDstart.map(line => line(1)).distinct().zipWithIndex().map(_.swap)
val mrg = verticesRDD.union(msgid).zipWithIndex().map(_.swap)
val distVertices = (mrg.map(line => (line._1,line._2._2))).map(_.swap)
val rddEdge = rddSender.join(rddReceiver).map(line => (line._2._1,line._2._2,line._1))
val mrgrdd = distVertices.join(rddEdge.map(line=> (line._3,line._2)))
val nxt_2 = (mrgrdd.map(line => (line._1,line._2._2,line._2._1)))
val test = nxt_2.map(line=> (line._1,line._3))
val test2 = test.join(rddEdge.map(line => (line._3,line._2)))
val test3 = test.join(rddEdge.map(line => (line._3,line._1)))
val final_coll = test3.join(test2)
val senderEdge = final_coll.map(line => (line._2._1._2, line._2._1._1,line._1))
val RcvrEdge = final_coll.map(line => (line._2._2._1, line._2._2._2,line._1))
val FinalEdge = senderEdge.union(RcvrEdge)
val edges: RDD[Edge[String]] = FinalEdge.map { line =>
Edge(line._1,line._2,line._3 ) }.distinct()
/////////////////////////////////////
class rootclass{}
case class UserNode(sName:String, sno:String, sDept:String) extends rootclass with Serializable
case class MessageNode(mID:String, received:String, toAddr:String, ccAddr:String, dkim:String, recv_spf:String, in_reply:String,
references:String, return_path:String, from:String, subject:String, content_type:String, dateTime:String,
del_dateTime:String, envelopTo:String, deliveredTo:String, importance:String, userAgent:String,
Xpriority:Int) extends rootclass with Serializable
val users: RDD[rootclass] = verticesRDD.map { line =>val cols = line._2.split(",")
(new UserNode(cols(0), cols(1), cols(2)))
}
val Message: RDD[rootclass] = msgid.map { line => val cols = line._2.split("%")
(new MessageNode(cols(0), cols(1), cols(2), cols(3), cols(4), cols(5),cols(6),cols(7), cols(8), cols(9), cols(10), cols(11), cols(12), cols(13), cols(14), cols(15),cols(16), cols(17),cols(18).toInt))
}
val nodes = (users union Message).zipWithIndex().map(_.swap)
val graph = Graph.apply(nodes, edges)
val num= graph.numVertices
。
但是我找不到以纯粹功能的方式实现相同复杂性的方法,也没有找到能够实现我想要的任何现有功能。我应该如何在Haskell中做到这一点?
答案 0 :(得分:5)
如果第二个列表没有重复项,并且第一个列表较长,则可以使用Data.Map
避免对第一个列表进行排序。这将实现n1 log n2
复杂度:
import Data.Map (fromList, toList, adjust)
countList :: Ord a => [a] -> [a] -> [(a, Int)]
countList l r = toList $ foldr (adjust (+1)) (fromList . zip r $ repeat 0) l
答案 1 :(得分:3)
我认为这可以实现您的目标:
import Data.List (sort)
countList :: Ord a => [a] -> [a] -> [(a, Int)]
countList l1 l2 = countList' (sort l1) (sort l2)
where countList' _ [] = []
countList' xs (y:ys) = let xs' = dropWhile (< y) xs
(a,b) = span (== y) xs'
in (y, length a) : countList' b ys
main = print $ countList ['a', 'b', 'c', 'b', 'b'] ['a', 'b', 'x']