Question

我在hadoop中有一个文本文件，我需要使用spark java api使用第二列对其进行排序。我正在使用数据框，但我不确定它的列。它可能包含动态列，意味着我不知道确切的列数。

我该怎么办？请帮帮我。

提前致谢。

Answer 1

首先，我试图在scala中给出一个csv示例（而不是java）

您可以使用Spark csv api创建数据框并根据所需的任何列进行排序。如果您有任何限制，请参阅以下方式。

固定数量的列：

从以下固定数量的列示例开始.. 你可以按照这个例子。

其中数据类似于ebay.csv：

“8213034705,95,2.927373，jake7870,0,95,117.5，XBOX，3”

//  SQLContext entry point for working with structured data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._

//define the schema using a case class
case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, bidderrate: Integer, openbid: Float, price: Float, item: String, daystolive: Integer)


 val auction = sc.textFile("ebay.csv").map(_.split(",")).map(p => 
Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt,p(5).toFloat,p(6).toFloat,p(7),p(8).toInt )).toDF()

// Display the top 20 rows of DataFrame 
auction.show()
// auctionid  bid   bidtime  bidder         bidderrate openbid price item daystolive
// 8213034705 95.0  2.927373 jake7870       0          95.0    117.5 xbox 3
// 8213034705 115.0 2.943484 davidbresler2  1          95.0    117.5 xbox 3 …


// Return the schema of this DataFrame
auction.printSchema()
root
 |-- auctionid: string (nullable = true)
 |-- bid: float (nullable = false)
 |-- bidtime: float (nullable = false)
 |-- bidder: string (nullable = true)
 |-- bidderrate: integer (nullable = true)
 |-- openbid: float (nullable = false)
 |-- price: float (nullable = false)
 |-- item: string (nullable = true)
 |-- daystolive: integer (nullable = true)

auction.sort("auctionid") // this will sort first column i.e auctionid

可变数量的列（since `Case` class with Array parameter is possible）：

你可以使用下面的伪代码，其中前4个元素是固定的，剩下的都是变量数组......

由于您只是插入第二列的排序，所以这将解决，所有其他数据将存在于该特定行的数组中，供以后使用。

case class Auction(auctionid: String, bid: Float, bidtime: Float, bidder: String, variablenumberofColumnsArray:String*)

 val auction = sc.textFile("ebay.csv").map(_.split(",")).map(p => 
Auction(p(0),p(1).toFloat,p(2).toFloat,p(3),p(4).toInt, VariableNumberOfColumnsArray or any complex type like Map ).toDF()

    auction.sort("auctionid") // this will sort first column i.e auctionid

如何在不知道数据模式的情况下从文本文件中将数据加载到spark数据框中？

1 个答案:

固定数量的列：

可变数量的列（since `Case` class with Array parameter is possible）：

如何在不知道数据模式的情况下从文本文件中将数据加载到spark数据框中？

1 个答案:

固定数量的列：

可变数量的列（since Case class with Array parameter is possible）：

可变数量的列（since `Case` class with Array parameter is possible）：