Question

我正在尝试将我的RDD变成pairdRDD，但是模式匹配有问题，我不知道我做错了什么..

val test = sc.textFile("neighborhood_test.csv");
val nhead0 = test.first;

val test_split = test.map(line => line.split("\t"));
val nhead = test_split.first;

val test_neigh0 = test.filter(line => line!= nhead0);
//test_neigh0.first = 3335  Dunlap  Seattle
val test_neigh1 = test_neigh0.map(line => line.split("\t")); 
//test_neigh1.first = Array[String] = Array(3335, Dunlap, Seattle)
val test_neigh = test_neigh1.map({case (id, neigh, city) => (id, (neigh, city))});

给出错误：

found   : (T1, T2, T3)
required: String
val test_neigh = test_neigh0.map({case (id, neigh, city) => (id, (neigh, city))});

修改输入文件是制表符分隔的，如下所示：

id  neighbourhood   city
3335    Dunlap  Seattle
4291    Roosevelt   Seattle
5682    South Delridge  Seattle

作为输出，我不想将id作为键，而(neigh, city)作为值。

Answer 1

test_neigh0.first和test_neigh1.first都不是三元组，因此您无法将其格式匹配。

test_neigh1中的元素是Array[String]。假设这些数组的长度均为3，您可以将它们与{ case Array(id, neigh, city) => ...}进行模式匹配。

为了确保如果其中一行作为多于或少于3个元素，您将不会得到匹配错误，您可以收集此模式匹配，而不是在其上进行映射。

val test_neigh: RDD[(String, (String, String))] = test_neigh1.collect{
  case Array(id, neigh, city) => (id, (neigh, city))
}

修改

您在评论中描述的问题与RDD[_]不是常用集合（例如List，Array或Set）有关。要避免这些，您可能需要在没有模式匹配的情况下获取数组中的元素：

val test_neigh: RDD[(String, (String, String))] = test_neigh0.map(line => { val arr = line.split("\t") (arr(0), (arr(1), arr(2)) })

Answer 2

val baseRDD = sc.textFile("neighborhood_test.csv").filter { x => !x.contains("city") }
baseRDD.map { x =>
      val split = x.split("\t")
      (split(0), (split(1), split(2)))
    }.groupByKey().foreach(println(_))

<强>结果：

（3335，CompactBuffer（（邓拉普，西雅图）））

（4291，CompactBuffer（（罗斯福，西雅图）））

（5682，CompactBuffer（（South Delridge，Seattle）））

Scala模式与map匹配的麻烦 - 必需的String

2 个答案: