Bigram频率的查询使用Apache Spark记录事件

时间:2015-02-06 22:44:57

标签: scala session pattern-matching apache-spark sequence

我想研究从搜索引擎查询日志中提取的会话中的用户操作。我定义了前两种操作:查询和Clics。

sealed trait Action{}
case class Query(val input:String) extends Action
case class Click(val link:String)  extends Action

假设查询日志中的第一个操作由以下时间戳给出,单位为毫秒:

val t0 = 1417444964686L // 2014-12-01 15:42:44

让我们定义与会话ID相关联的暂时有序动作的语料库。

val query_log:Array[(String, (Action, Long))] = Array (
("session1",(Query("query1"),t0)), 
("session1",(Click("link1") ,t0+1000)), 
("session1",(Click("link2") ,t0+2000)), 
("session1",(Query("query2"),t0+3000)), 
("session1",(Click("link3") ,t0+4000)), 
("session2",(Query("query3"),t0+5000)), 
("session2",(Click("link4") ,t0+6000)), 
("session2",(Query("query4"),t0+7000)), 
("session2",(Query("query5"),t0+8000)),
("session2",(Click("link5") ,t0+9000)),
("session2",(Click("link6") ,t0+10000)),
("session3",(Query("query6"),t0+11000))
)

我们为这个quey_log创建一个RDD:

import org.apache.spark.rdd.RDD
var logs:RDD[(String, (Action, Long))] = sc.makeRDD(query_log)

然后按会话ID

对日志进行分组
val sessions_groups:RDD[(String, Iterable[(Action, Long)])] = logs.groupByKey().cache()

现在,我们希望研究会话中的动作共现,例如,会话中的重写次数。然后,我们定义将从会话操作初始化的类Cooccurrences。

case class Cooccurrences(
  var numQueriesWithClicks:Int = 0,
  var numQueries:Int = 0,
  var numRewritings:Int = 0,
  var numQueriesBeforeClicks:Int = 0
 ) {
 // The cooccurrence object is initialized from a list of timestamped action in order to catch a session group
  def initFromActions(actions:Iterable[(Action, Long)]) = {
    // 30 seconds is the maximal time (in milliseconds) between two  queries (q1, q2) to consider q2 is a rewririting of q1
    var thirtySeconds = 30000 
    var hasClicked = false 
    var hasRewritten = false
    // int the observed action sequence, we extract consecutives (sliding(2)) actions sorted by timestamps
    // for each bigram in the sequence we want to count and modify the cooccurrence object
    actions.toSeq.sortBy(_._2).sliding(2).foreach{ 
      // case Seq(l0) => // session with only one Action 
      case Seq((e1:Click, t0)) => { // click without any query
        numQueries = 0        
      }
      case Seq((e1:Query, t0)) => { // query without any click
        numQueries = 1        
        numQueriesBeforeClicks = 1
      }
      // case Seq(l0, l1) => // session with at least two Actions
      case Seq((e1:Click, t0), (e2:Query, t1)) => { // a click followed by a query
        if(! hasClicked)
          numQueriesBeforeClicks = numQueries
        hasClicked = true
        }
      case Seq((e1:Click, t0), (e2:Click, t1)) => { //two consecutives clics 
        if(! hasClicked)
          numQueriesBeforeClicks = numQueries
        hasClicked = true
      }
      case Seq((e1:Query, t0), (e2:Click, t1)) => { // a query followed by a click
        numQueries += 1
        if(! hasClicked)
          numQueriesBeforeClicks = numQueries
        hasClicked = true
        numQueriesWithClicks +=1
      }
      case Seq((e1:Query, t0), (e2:Query, t1)) => { // two consecutives queries
        val dt = t1 - t0
        numQueries += 1
        if(dt < thirtySeconds && e1.input != e2.input){
          hasRewritten = true
          numRewritings += 1
       }
      }
    }
  }

}

现在,让我们尝试为每个会话计算同时发生的RDD:

val session_cooc_stats:RDD[Cooccurrences] = sessions_groups.map{ 
  case (sessionId, actions) => {
   var coocs  = Cooccurrences()
   coocs.initFromActions(actions)
   coocs
  }
 }

不幸的是,它引发了以下MatchError

scala> session_cooc_stats.take(2)

15/02/06 22:50:08 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 4) scala.MatchError: List((Query(query3),1417444969686), (Click(link4),1417444970686)) (of class scala.collection.immutable.$colon$colon) at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
  at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at $line25.$read$$iwC$$iwC$Cooccurrences.initFromActions(<console>:29)
  at $line28.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
  at $line28.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:28)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
  at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081)
  at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
  at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
15/02/06 22:50:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4, localhost): scala.MatchError: List((Query(query3),1417444969686), (Click(link4),1417444970686)) (of class scala.collection.immutable.$colon$colon)
  at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
  at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
 ...

如果我构建自己的行动列表,相当于session_cooc_stats RDD

中的第一个组
val actions:Iterable[(Action, Long)] = Array(
(Query("query1"),t0),
(Click("link1") ,t0+1000),
(Click("link2") ,t0+2000),
(Query("query2"),t0+3000),
(Click("link3") ,t0+4000)
)

我得到了预期的结果

var c = Cooccurrences()
c.initFromActions(actions)
// c == Cooccurrences(2,2,0,1)

当我从RDD构建Cooccurrence对象时,似乎有些错误。 它似乎链接到使用groupByKey()构建的CompactBuffer 缺什么 ?

我是Spark和Scala的新手。 谢谢你的帮助。

托马斯

2 个答案:

答案 0 :(得分:0)

正如您所建议的那样,我使用IntelliJ重写了代码,并为main函数创建了一个伴随对象。 令人惊讶的是,代码编译(使用sbt)并且运行完美。

但是,我真的不明白为什么编译代码会运行,而它不适用于spark-shell。

谢谢你的回答!

答案 1 :(得分:-2)

我在IntelliJ上设置你的代码。

为Action,Query,Click和Coocurence创建一个类。

你的代码在主。

val sessions_groups:RDD[(String, Iterable[(Action, Long)])] = logs.groupByKey().cache()

  val session_cooc_stats:RDD[Cooccurrences] = sessions_groups.map{
    case (sessionId, actions) => {
      val coocs  = Cooccurrences()
      coocs.initFromActions(actions)
      coocs
    }
  }
  session_cooc_stats.take(2).foreach(println(_))

刚刚修改过的var coocs&gt; val coocs

我想这就是重点。

同现(0,1,0,1)

同现(2,3,1,1)