有2个非常大的RDD(每个都有超过百万的记录),第一个是:
rdd1.txt(name,value):
chr1 10016
chr1 10017
chr1 10018
chr1 20026
chr1 20036
chr1 25016
chr1 26026
chr2 40016
chr2 40116
chr2 50016
chr3 70016
rdd2.txt(name,min,max):
chr1 10000 20000
chr1 20000 30000
chr2 40000 50000
chr2 50000 60000
chr3 70000 80000
chr3 810001 910000
chr3 860001 960000
chr3 910001 1010000
该值仅在第二个RDD的最小值和最大值之间的范围内有效时,如果有效,则该值的名称将加1;
以上面的例子为例,chr1发生了7。
如何通过spark获得Scala的结果?
非常感谢答案 0 :(得分:2)
尝试:
val rdd1 = sc.parallelize(Seq(
("chr1", 10016 ), ("chr1", 10017), ("chr1", 10018)))
val rdd2 = sc.parallelize(Seq(
("chr1", 10000, 20000), ("chr1",20000, 30000)))
rdd1.toDF("name", "value").join(rdd2.toDF("name", "min", "max"), Seq("name"))
.where($"value".between($"min", $"max"))
答案 1 :(得分:0)
据我了解,您希望rdd1中的值介于rdd2中的min和max之间。请查看以下内容是否有效
val rdd1 = sc.parallelize(Seq(("chr1", 10016 ), ("chr1", 10017), ("chr1", 10018)))
val rdd2 = sc.parallelize(Seq(("chr1", 10000, 20000), ("chr1",20000, 30000)))
rdd1.toDF("name", "value").join(rdd2.toDF("name", "min", "max"), Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()
scala> val rdd1=sc.parallelize(Seq(("chr1", 10016 ),("chr1", 10017 ),("chr1", 10018 ),("chr1", 20026 ),("chr1", 20036 ),("chr1", 25016 ),("chr1", 26026),("chr2", 40016 ),("chr2", 40116 ),("chr2", 50016 ),("chr3", 70016 )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[33] at parallelize at <console>:24
scala> val rdd2=sc.parallelize(Seq(("chr1", 10000, 20000),("chr1", 20000 , 30000),("chr2", 40000 ,50000),("chr2", 50000 ,60000),("chr3", 70000 ,80000),("chr3", 810001 ,910000),("chr3", 860001 ,960000),("chr3", 910001 ,1010000)))
rdd2: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[34] at parallelize at <console>:24
scala> rdd1.toDF("name", "value").join(rdd2.toDF("name", "min", "max"), Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()
+----+-----+
|name|count|
+----+-----+
|chr3| 1|
|chr1| 7|
|chr2| 3|
+----+-----+
<强>编辑强> 如果您正在阅读文件,我将使用以下
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val sqlContext = new SQLContext(sc)
val nameValueSchema = StructType(Array(StructField("name", StringType, true),StructField("value", IntegerType, true)))
val nameMinMaxSchema = StructType(Array(StructField("name", StringType, true),StructField("min", IntegerType, true),StructField("max", IntegerType, true)))
val rdd1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").schema(nameValueSchema).load("rdd1.csv")
val rdd2 = sqlContext.read.format("com.databricks.spark.csv").option("header", "false").schema(nameMinMaxSchema).load("rdd2.csv")
rdd1.toDF("name", "value").join(rdd2.toDF("name", "min", "max"), Seq("name")).where($"value".between($"min", $"max")).groupBy($"name").count().show()
这将在所有节点上运行,不需要并行调用。引用documentation这里
def parallelize [T](seq:Seq [T],numSlices:Int = defaultParallelism)(隐式arg0:ClassTag [T]):RDD [T]永久链接 分发本地Scala集合以形成RDD。