Question

我创建了一个Twitter数据流，它以下面的格式显示主题标签，作者和提到的用户。

(List(timetofly, hellocake),Shera_Eyra,List(blxcknicotine, kimtheskimm))

由于嵌入式列表，我无法对此格式进行分析。如何创建另一个以此格式显示数据的数据流？

timetofly, Shera_Eyra, blxcknicotine timetofly, Shera_Eyra, kimtheskimm hellocake, Shera_Eyra, blxcknicotine hellocake, Shera_Eyra, kimtheskimm

以下是我生成数据的代码：

 val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
 val ssc = new StreamingContext(sparkConf, Seconds(sampleInterval)) 
 val stream = TwitterUtils.createStream(ssc, None) 
 val data = stream.map {line => 
        (line.getHashtagEntities.map(_.getText),
        line.getUser().getScreenName(),
        line.getUserMentionEntities.map(_.getScreenName).toList)
  }

Answer 1

在您的代码段中，data是DStream[(Array[String], String, List[String])]。要获得所需格式的DStream[String]，您可以使用flatMap和map：

val data = stream.map { line =>
  (line.getHashtagEntities.map(_.getText),
   line.getUser().getScreenName(),
   line.getUserMentionEntities.map(_.getScreenName).toList)
}

val data2 = data.flatMap(a => a._1.flatMap(b => a._3.map(c => (b, a._2, c))))
                .map { case (hash, user, mention) => s"$hash, $user, $mention" }

flatMap导致DStream[(String, String, String)]，其中每个元组由散列标签实体，用户和提及实体组成。随后使用模式匹配调用map会创建一个DStream[String]，其中每个String由每个元组中的元素组成，以逗号和空格分隔。

Answer 2

我会用它来理解：

  val data = (List("timetofly", "hellocake"), "Shera_Eyra", List("blxcknicotine", "kimtheskimm"))

  val result = for {
    hashtag <- data._1
    user = data._2
    mentionedUser <- data._3
  } yield (hashtag, user, mentionedUser)

  result.foreach(println)

输出：

(timetofly,Shera_Eyra,blxcknicotine)
(timetofly,Shera_Eyra,kimtheskimm)
(hellocake,Shera_Eyra,blxcknicotine)
(hellocake,Shera_Eyra,kimtheskimm)

如果您更喜欢 seq的字符串列表，而不是 seq的字符串元组，那么请更改yield以给您一个列表：{{ 1}}

Scala展平嵌入的列表列表

2 个答案: