我有几个这两种列表(List[Array[String]]
):
1)List(Array("Mark","2000","2002"), Array("John","2001","2003"), Array("Andrew","1999","2001"), Array("Erik","1996","1998"))
2)List(Array("Steve","2000","2005"))
基于这个条件:
如果年份的范围重叠,则意味着这些人知道对方否则没有。
我期待的是以这种方式分组的数据:
Array(name, start_year, end_year, known_people, unknown_people)
因此,对于具体示例1),最终结果为:
List(
Array("Mark", "2000", "2002", "John#Andrew", "Erik"),
Array("John", "2001", "2003", "Mark#Andrew", "Erik"),
Array("Andrew", "1999", "2001", "Mark#John", "Erik"),
Array("Erik", "1996", "1998", "", "Mark#John#Andrew")
)
仅针对第二种情况:
List(Array("Steve","2000","2005", "", ""))
我不知道该怎么办,因为我被困在做笛卡尔产品并过滤掉同样的名字,如:
my_list.cartesian(my_list).filter { case (a,b) => a(0) != b(0) }
但此时我无法将工作作为aggregateByKey
。
有什么想法吗?
答案 0 :(得分:3)
代码
class UnsortedTestSuite3 extends SparkFunSuite {
configuredUnitTest("SO - aggregateByKey") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{UserDefinedFunction, Column, SQLContext, DataFrame}
val persons = Seq(
Person("Mark", 2000, 2002),
Person("John", 2001, 2003),
Person("Andrew", 1999, 2001),
Person("Erik", 1996, 1998)
)
// input
val personDF = sc.parallelize( persons ).toDF
val personRenamedDF = personDF.select(
col("name").as("right_name"),
col("fromYear").as("right_fromYear"),
col("toYear").as("right_toYear")
)
/**
* Group entries of a DateFrame by entries in second column.
* @param df a dataframe with two string columns
* @return dataframe, where second column contains group of values for the an identical entry in first column
*/
def groupBySecond( df: DataFrame ) : DataFrame = {
val st: StructType = df.schema
if ( (st.size != 2) &&
(! st(0).dataType.equals(StringType) ) &&
(! st(1).dataType.equals(StringType) ) ) throw new RuntimeException("Wrong schema for groupBySecond.")
df.rdd
.map( row => (row.getString(0), row.getString(1)) )
.groupByKey().map( x => ( x._1, x._2.toList))
.toDF( st(0).name, st(1).name )
}
val joined = personDF.join(personRenamedDF, col("name") !== col("right_name"), "inner")
val intervalOverlaps = (col("toYear") >= col("right_fromYear")) && (col("fromYear") <= col("right_toYear"))
val known = groupBySecond( joined.filter( intervalOverlaps ).select(col("name"), col("right_name").as("knows")) )
val unknown = groupBySecond( joined.filter( !intervalOverlaps ).select(col("name"), col("right_name").as("does_not_know")) )
personDF.join( known, "name").join(unknown, "name").show()
}
}
为您提供预期结果
+------+--------+------+--------------+-------------+
| name|fromYear|toYear| knows|does_not_know|
+------+--------+------+--------------+-------------+
| John| 2001| 2003|[Mark, Andrew]| [Erik]|
| Mark| 2000| 2002|[John, Andrew]| [Erik]|
|Andrew| 1999| 2001| [Mark, John]| [Erik]|
+------+--------+------+--------------+-------------+
Array
而烦恼。groupBySecond
在DataFrame上执行groupBy
。目前,这在Spark SQL中是不可能的,因为还没有UDAF(用户定义的聚合函数)。 将提出随后的SO票,以便听取此专家 personDF
和known
DataFrame加入原始数据框unknown
,以获得最终结果。我刚刚发现现在的代码没有提供正确的结果。 (Erik失踪了!)
因此
case class Person(name: String, fromYear: Int, toYear: Int)
class UnsortedTestSuite3 extends SparkFunSuite {
configuredUnitTest("SO - aggregateByKey") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{UserDefinedFunction, Column, SQLContext, DataFrame}
val persons = Seq(
Person("Mark", 2000, 2002),
Person("John", 2001, 2003),
Person("Andrew", 1999, 2001),
Person("Erik", 1996, 1998)
)
// input
val personDF = sc.parallelize( persons ).toDF
val personRenamedDF = personDF.select(
col("name").as("right_name"),
col("fromYear").as("right_fromYear"),
col("toYear").as("right_toYear")
)
/**
* Group entries of a DateFrame by entries in second column.
* @param df a dataframe with two string columns
* @return dataframe, where second column contains group of values for the an identical entry in first column
*/
def groupBySecond( df: DataFrame ) : DataFrame = {
val st: StructType = df.schema
if ( (st.size != 2) &&
(! st(0).dataType.equals(StringType) ) &&
(! st(1).dataType.equals(StringType) ) ) throw new RuntimeException("Wrong schema for groupBySecond.")
df.rdd
.map( row => (row.getString(0), row.getString(1)) )
.groupByKey().map( x => ( x._1, if (x._2 == List(null)) List() else x._2.toList ))
.toDF( st(0).name, st(1).name )
}
val distinctName = col("name") !== col("right_name")
val intervalOverlaps = (col("toYear") >= col("right_fromYear")) && (col("fromYear") <= col("right_toYear"))
val knownDF_t = personDF.join(personRenamedDF, distinctName && intervalOverlaps, "leftouter")
val knownDF = groupBySecond( knownDF_t.select(col("name").as("kname"), col("right_name").as("knows")) )
val unknownDF_t = personDF.join(personRenamedDF, distinctName && !intervalOverlaps, "leftouter")
val unknownDF = groupBySecond( unknownDF_t.filter( !intervalOverlaps ).select(col("name")as("uname"), col("right_name").as("does_not_know")) )
personDF
.join( knownDF, personDF("name") === knownDF("kname"), "leftouter")
.join( unknownDF, personDF("name") === unknownDF("uname"), "leftouter")
.select( col("name"), col("fromYear"), col("toYear"), col("knows"), col("does_not_know"))
.show()
}
}
结果
+------+--------+------+--------------+--------------------+
| name|fromYear|toYear| knows| does_not_know|
+------+--------+------+--------------+--------------------+
| John| 2001| 2003|[Mark, Andrew]| [Erik]|
| Mark| 2000| 2002|[John, Andrew]| [Erik]|
|Andrew| 1999| 2001| [Mark, John]| [Erik]|
| Erik| 1996| 1998| []|[Mark, John, Andrew]|
+------+--------+------+--------------+--------------------+