我正在尝试在Spark中创建一批Dataset
行。
为了保持发送到服务的记录数量,我想批量项目,以便我可以保持数据发送的速率。
对于,
case class Person(name:String, address: String)
case class PersonBatch(personBatch: List[Person])
对于给定的Dataset[Person]
,我想创建Dataset[PersonBatch]
例如,如果输入Dataset[Person]
有100条记录,则输出Dataset
应该类似于Dataset[PersonBatch]
,其中每个PersonBatch
都应该是n
条记录的列表(人)
我已经尝试了这个但是它没有用。
object DataBatcher extends Logger {
var batchList: ListBuffer[PersonBatch] = ListBuffer[PersonBatch]()
var batchSize: Long = 500 //default batch size
def addToBatchList(batch: PersonBatch): Unit = {
batchList += batch
}
def clearBatchList(): Unit = {
batchList.clear()
}
def createBatches(ds: Dataset[Person]): Dataset[PersonBatch] = {
val dsCount = ds.count()
logger.info(s"Count of dataset passed for creating batches : ${dsCount}")
val batchElement = ListBuffer[Person]()
val batch = PersonBatch(batchElement)
ds.foreach(x => {
batch.personBatch += x
if(batch.personBatch.length == batchSize) {
addToBatchList(batch)
batch.requestBatch.clear()
}
})
if(batch.personBatch.length > 0) {
addToBatchList(batch)
batch.personBatch.clear()
}
sparkSession.createDataset(batchList)
}
}
我想在Hadoop集群上运行这项工作。 有人可以帮我这个吗?
答案 0 :(得分:1)
rdd.iterator具有分组功能可能对您有用。
例如:
iter.grouped(BATCHSIZE)
示例代码片段,其中使用iter.grouped(batchsize)进行批量插入,其中1000和Im尝试插入数据库
df.repartition(numofpartitionsyouwant) // numPartitions ~ number of simultaneous DB connections you can planning to give...
def insertToTable(sqlDatabaseConnectionString: String,
sqlTableName: String): Unit = {
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
//NOTE : EACH PARTITION ONE CONNECTION (more better way is to use connection pools)
val sqlExecutorConnection: Connection =
DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach { group =>
val insertString: scala.collection.mutable.StringBuilder =
new scala.collection.mutable.StringBuilder()
group.foreach { record =>
insertString.append("('" + record.mkString(",") + "'),")
}
sqlExecutorConnection
.createStatement()
.executeUpdate(f"INSERT INTO [$sqlTableName] ($tableHeader) VALUES "
+ insertString.stripSuffix(","))
}
sqlExecutorConnection.close() // close the connection so that connections wont exhaust.
}
}
答案 1 :(得分:0)
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition((it: Iterator[Row]) => {
println("partition index: " )
val url = "jdbc:...+ "user=;password=;"
val conn = DriverManager.getConnection(url)
conn.setAutoCommit(true)
val stmt = conn.createStatement()
val batchSize = 10
var i =0
while (it.hasNext) {
val row = it.next
import java.sql.SQLException
import java.sql.SQLIntegrityConstraintViolationException
try {
stmt.addBatch(" UPDATE TABLE SET STATUS = 0 , " +
" DATE ='" + new java.sql.Timestamp(System.currentTimeMillis()) +"'" +
" where id = " + row.getAs("IDNUM") )
i += 1
if ( i % batchSize == 0 ) {
stmt.executeBatch
conn.commit
}
} catch {
case e: SQLIntegrityConstraintViolationException =>
case e: SQLException =>
e.printStackTrace()
}
finally {
stmt.executeBatch
conn.commit
}
}
import java.util
val ret = stmt.executeBatch
System.out.println("Ret val: " + util.Arrays.toString(ret))
System.out.println("Update count: " + stmt.getUpdateCount)
conn.commit
stmt.close