Question

我是初学者尝试使用Scala使用带有一些过滤器关键字的火花流来获取推文。是否有可能只过滤掉流媒体后没有地理定位的推文？我想在ElasticSearch中保存推文。因此，在将推文地图保存到ElasticSearch之前，我可以使用地理位置信息过滤那些，然后保存它们吗？我正在使用json4s.JSONDSL创建JSON，其中包含来自推文的字段。这是示例代码

val stream = TwitterUtils.createStream（ssc，None，filters） val tweetMap = stream.map（status =＆gt; { val tweetMap =

      ("location" -> Option(status.getGeoLocation).map(geo => { s"${geo.getLatitude},${geo.getLongitude}" })) ~
      ("UserLang" -> status.getUser.getLang) ~
      ("UserLocation" -> Option(status.getUser.getLocation)) ~
      ("UserName" -> status.getUser.getName) ~
      ("Text" -> status.getText) ~
      ("TextLength" -> status.getText.length) ~
      //Tokenized the tweet message and then filtered only words starting with #
      ("HashTags" -> status.getText.split(" ").filter(_.startsWith("#")).mkString(" ")) ~
      ("PlaceCountry" -> Option(status.getPlace).map (pl => {s"${pl.getCountry}"}))

tweetMap.map（s =＆gt; List（＆＃34; Tweet Extracted＆＃34;））。print

// Each batch is saved to Elasticsearch 
tweetMap.foreachRDD { tweets => EsSpark.saveToEs(tweets, "sparksender/tweets")) }

//在此步骤之前有一种方法来过滤掉有＆＃34; location＆＃34;为空？

我引用了github的代码： https://github.com/luvgupta008/ScreamingTwitter/blob/master/src/main/scala/com/spark/streaming/TwitterTransmitter.scala

Answer 1

查看RDD上的filter方法。采用谓词函数(a: A) => Boolean。如果返回值为true，则将元素添加到列表中。如果它为假，则该元素不会被添加到列表中。

tweetMap.filter(
  status => Option(status.getGeoLocation) match {
    case Some(_) => true
    case None => false
  })

Spark流媒体 - 在使用地理位置进行流式传输后过滤推文

1 个答案: