Scala模式与map匹配的麻烦 - 必需的String

时间:2017-03-21 12:32:14

标签: scala apache-spark

我正在尝试将我的RDD变成pairdRDD,但是模式匹配有问题,我不知道我做错了什么..

val test = sc.textFile("neighborhood_test.csv");
val nhead0 = test.first;

val test_split = test.map(line => line.split("\t"));
val nhead = test_split.first;

val test_neigh0 = test.filter(line => line!= nhead0);
//test_neigh0.first = 3335  Dunlap  Seattle
val test_neigh1 = test_neigh0.map(line => line.split("\t")); 
//test_neigh1.first = Array[String] = Array(3335, Dunlap, Seattle)
val test_neigh = test_neigh1.map({case (id, neigh, city) => (id, (neigh, city))});

给出错误:

found   : (T1, T2, T3)
required: String
val test_neigh = test_neigh0.map({case (id, neigh, city) => (id, (neigh, city))});

修改 输入文件是制表符分隔的,如下所示:

id  neighbourhood   city
3335    Dunlap  Seattle
4291    Roosevelt   Seattle
5682    South Delridge  Seattle

作为输出,我不想将id作为键,而(neigh, city)作为值。

2 个答案:

答案 0 :(得分:3)

test_neigh0.firsttest_neigh1.first都不是三元组,因此您无法将其格式匹配。

test_neigh1中的元素是Array[String]。假设这些数组的长度均为3,您可以将它们与{ case Array(id, neigh, city) => ...}进行模式匹配。

为了确保如果其中一行作为多于或少于3个元素,您将不会得到匹配错误,您可以收集此模式匹配,而不是在其上进行映射。

val test_neigh: RDD[(String, (String, String))] = test_neigh1.collect{
  case Array(id, neigh, city) => (id, (neigh, city))
} 

修改

您在评论中描述的问题与RDD[_]不是常用集合(例如ListArraySet)有关。要避免这些,您可能需要在没有模式匹配的情况下获取数组中的元素:

val test_neigh: RDD[(String, (String, String))] = test_neigh0.map(line => {
  val arr = line.split("\t")
  (arr(0), (arr(1), arr(2))
})

答案 1 :(得分:2)

val baseRDD = sc.textFile("neighborhood_test.csv").filter { x => !x.contains("city") }
baseRDD.map { x =>
      val split = x.split("\t")
      (split(0), (split(1), split(2)))
    }.groupByKey().foreach(println(_))

<强>结果:

(3335,CompactBuffer((邓拉普,西雅图)))

(4291,CompactBuffer((罗斯福,西雅图)))

(5682,CompactBuffer((South Delridge,Seattle)))