如何删除要在scala中删除的数据框的特定部分

时间:2017-12-02 14:16:39

标签: scala apache-spark spark-dataframe scala-collections

我已阅读使用scala spark转换为dataframe表的文本文件。 我还有这个问题:

表格如下:

 +------------+----------+----+----+
    |value       |col1      |col2|col3|
    +------------+----------+----+----+
    |FIRST:      |FIRST:    |null|null|
    |erewwetrt=1 |erewwetrt |1   |null|
    |wrtertret=2 |wrtertret |2   |null|
    |ertertert=3 |ertertert |3   |null|
    |;           |;         |null|null|
    |FIRST:      |FIRST:    |null|null|
    |asdafdfd=1  |asdafdfd  |1   |null|
    |adadfadf=2  |adadfadf  |2   |null|
    |adfdafdf=3  |adfdafdf  |3   |null|
    |;           |;         |null|null|
    |SECOND:     |SECOND:   |null|null|
    |adfsfsdfgg=1|adfsfsdfgg|1   |null|
    |sdfsdfdfg=2 |sdfsdfdfg |2   |null|
    |sdfsdgsdg=3 |sdfsdgsdg |3   |null|
    |;           |;         |null|null|

因此,最终的数据帧表看起来像(需要只包含FIRST部分......)

+------------+----------+----+----+
 |value       |col1      |col2|col3|
 +------------+----------+----+----+
 |FIRST:      |FIRST:    |null|null|
 |erewwetrt=1 |erewwetrt |1   |null|
 |wrtertret=2 |wrtertret |2   |null|
 |ertertert=3 |ertertert |3   |null|
 |;           |;         |null|null|
 |FIRST:      |FIRST:    |null|null|
 |asdafdfd=1  |asdafdfd  |1   |null|
 |adadfadf=2  |adadfadf  |2   |null|
 |adfdafdf=3  |adfdafdf  |3   |null|
 |;           |;         |null|null|
 ...

我的问题是,如何将SECOND中的行删除到;

如何在scala spark中实现它?

3 个答案:

答案 0 :(得分:0)

所以这是我快速而又肮脏的解决方案:(见下面的更新解决方案)

//Lets define a sample DF (just like your DF)
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF
//Get the index of the row where "second" occurs
val idx = df.rdd.zipWithIndex.filter(x=> x._1(0) == "Second").map(x=> x._2).first
//filter
val res = df.rdd.zipWithIndex.filter(x=> x._2 < idx).map(x=> x._1)
//and the result:
res.collect
//Array[org.apache.spark.sql.Row] = Array([First,1], [First,2], [dummy,3])

哦,如果你想将它转换回DF,请执行以下操作:

val df_res = spark.createDataFrame(res,df.schema)

更新的解决方案: 基于其他输入,我正在更新我的答案如下: (我的假设是“Second:.....”只在文件中出现一次。如果它没有,你现在应该知道如何通过它来工作)

//new df for illustration
val df = spark.sparkContext.parallelize(Array(("First:",1),(";",2),("dummy",3),(";",4),("Second:",5),("some value",5), (";",6),("First:",7),(";",8) )).toDF
//zip wit index
val rdd = df.rdd.zipWithIndex
//this looks like:
rdd.collect
//res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([Second,5],4), ([some value,5],5), ([;,6],6), ([First,7],7), ([;,8],8))
// find the relevant index locations for "second" and ";"
val idx_second:Long = rdd.filter(x=> x._1(0) == "Second:").map(x=> x._2).first
val idx_semic:Long = rdd.filter(x=> x._1(0) == ";").filter(x=> x._2 >= idx_second).map(x=> x._2).first
// and here is the result
val result = rdd.filter(x=> (x._2 < idx_second) || (x._2 >idx_semic))
// verify the result
rdd.collect
// res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([First,7],6), ([;,8],7))

答案 1 :(得分:0)

创建@datamannz提到的数据框

val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF("value", "col1")

df.filter(col("value").notEqual("Second")).show
+-----+----+
|value|col1|
+-----+----+
|First|   1|
|First|   2|
|dummy|   3|
+-----+----+

答案 2 :(得分:0)

答案如下, file.filter(col(x).notEqual(y)).aggregate()