我正在尝试将我的RDD变成pairdRDD,但是模式匹配有问题,我不知道我做错了什么..
val test = sc.textFile("neighborhood_test.csv");
val nhead0 = test.first;
val test_split = test.map(line => line.split("\t"));
val nhead = test_split.first;
val test_neigh0 = test.filter(line => line!= nhead0);
//test_neigh0.first = 3335 Dunlap Seattle
val test_neigh1 = test_neigh0.map(line => line.split("\t"));
//test_neigh1.first = Array[String] = Array(3335, Dunlap, Seattle)
val test_neigh = test_neigh1.map({case (id, neigh, city) => (id, (neigh, city))});
给出错误:
found : (T1, T2, T3)
required: String
val test_neigh = test_neigh0.map({case (id, neigh, city) => (id, (neigh, city))});
修改 输入文件是制表符分隔的,如下所示:
id neighbourhood city
3335 Dunlap Seattle
4291 Roosevelt Seattle
5682 South Delridge Seattle
作为输出,我不想将id
作为键,而(neigh, city)
作为值。
答案 0 :(得分:3)
test_neigh0.first
和test_neigh1.first
都不是三元组,因此您无法将其格式匹配。
test_neigh1
中的元素是Array[String]
。假设这些数组的长度均为3,您可以将它们与{ case Array(id, neigh, city) => ...}
进行模式匹配。
为了确保如果其中一行作为多于或少于3个元素,您将不会得到匹配错误,您可以收集此模式匹配,而不是在其上进行映射。
val test_neigh: RDD[(String, (String, String))] = test_neigh1.collect{
case Array(id, neigh, city) => (id, (neigh, city))
}
修改强>
您在评论中描述的问题与RDD[_]
不是常用集合(例如List
,Array
或Set
)有关。要避免这些,您可能需要在没有模式匹配的情况下获取数组中的元素:
val test_neigh: RDD[(String, (String, String))] = test_neigh0.map(line => {
val arr = line.split("\t")
(arr(0), (arr(1), arr(2))
})
答案 1 :(得分:2)
val baseRDD = sc.textFile("neighborhood_test.csv").filter { x => !x.contains("city") }
baseRDD.map { x =>
val split = x.split("\t")
(split(0), (split(1), split(2)))
}.groupByKey().foreach(println(_))
<强>结果:强>
(3335,CompactBuffer((邓拉普,西雅图)))
(4291,CompactBuffer((罗斯福,西雅图)))
(5682,CompactBuffer((South Delridge,Seattle)))