我已阅读使用scala spark转换为dataframe表的文本文件。 我还有这个问题:
表格如下:
+------------+----------+----+----+
|value |col1 |col2|col3|
+------------+----------+----+----+
|FIRST: |FIRST: |null|null|
|erewwetrt=1 |erewwetrt |1 |null|
|wrtertret=2 |wrtertret |2 |null|
|ertertert=3 |ertertert |3 |null|
|; |; |null|null|
|FIRST: |FIRST: |null|null|
|asdafdfd=1 |asdafdfd |1 |null|
|adadfadf=2 |adadfadf |2 |null|
|adfdafdf=3 |adfdafdf |3 |null|
|; |; |null|null|
|SECOND: |SECOND: |null|null|
|adfsfsdfgg=1|adfsfsdfgg|1 |null|
|sdfsdfdfg=2 |sdfsdfdfg |2 |null|
|sdfsdgsdg=3 |sdfsdgsdg |3 |null|
|; |; |null|null|
因此,最终的数据帧表看起来像(需要只包含FIRST部分......)
+------------+----------+----+----+
|value |col1 |col2|col3|
+------------+----------+----+----+
|FIRST: |FIRST: |null|null|
|erewwetrt=1 |erewwetrt |1 |null|
|wrtertret=2 |wrtertret |2 |null|
|ertertert=3 |ertertert |3 |null|
|; |; |null|null|
|FIRST: |FIRST: |null|null|
|asdafdfd=1 |asdafdfd |1 |null|
|adadfadf=2 |adadfadf |2 |null|
|adfdafdf=3 |adfdafdf |3 |null|
|; |; |null|null|
...
我的问题是,如何将SECOND中的行删除到;
。
如何在scala spark中实现它?
答案 0 :(得分:0)
所以这是我快速而又肮脏的解决方案:(见下面的更新解决方案)
//Lets define a sample DF (just like your DF)
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF
//Get the index of the row where "second" occurs
val idx = df.rdd.zipWithIndex.filter(x=> x._1(0) == "Second").map(x=> x._2).first
//filter
val res = df.rdd.zipWithIndex.filter(x=> x._2 < idx).map(x=> x._1)
//and the result:
res.collect
//Array[org.apache.spark.sql.Row] = Array([First,1], [First,2], [dummy,3])
哦,如果你想将它转换回DF,请执行以下操作:
val df_res = spark.createDataFrame(res,df.schema)
更新的解决方案: 基于其他输入,我正在更新我的答案如下: (我的假设是“Second:.....”只在文件中出现一次。如果它没有,你现在应该知道如何通过它来工作)
//new df for illustration
val df = spark.sparkContext.parallelize(Array(("First:",1),(";",2),("dummy",3),(";",4),("Second:",5),("some value",5), (";",6),("First:",7),(";",8) )).toDF
//zip wit index
val rdd = df.rdd.zipWithIndex
//this looks like:
rdd.collect
//res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([Second,5],4), ([some value,5],5), ([;,6],6), ([First,7],7), ([;,8],8))
// find the relevant index locations for "second" and ";"
val idx_second:Long = rdd.filter(x=> x._1(0) == "Second:").map(x=> x._2).first
val idx_semic:Long = rdd.filter(x=> x._1(0) == ";").filter(x=> x._2 >= idx_second).map(x=> x._2).first
// and here is the result
val result = rdd.filter(x=> (x._2 < idx_second) || (x._2 >idx_semic))
// verify the result
rdd.collect
// res: Array[(org.apache.spark.sql.Row, Long)] = Array(([First,1],0), ([;,2],1), ([dummy,3],2), ([;,4],3), ([First,7],6), ([;,8],7))
答案 1 :(得分:0)
创建@datamannz提到的数据框
val df = spark.sparkContext.parallelize(Array(("First",1),("First",2),("dummy",3),("Second",4))).toDF("value", "col1")
df.filter(col("value").notEqual("Second")).show
+-----+----+
|value|col1|
+-----+----+
|First| 1|
|First| 2|
|dummy| 3|
+-----+----+
答案 2 :(得分:0)
答案如下, file.filter(col(x).notEqual(y)).aggregate()