虽然这个例子很容易理解:
val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.map(x => index + "," + (x, x, x+100))
}
rdd1.mapPartitionsWithIndex(myfunc).collect()
我一直在尝试通过mapPartitions中的JDBC调用获取一些数据,并允许进行一些基本的并行处理。实际上我提出的例子实际上并不是有效的,但为了论证,想象有一些JDBC源代码,让我们说,一些复杂的逻辑,不适合数据帧,易于RDD处理等。请耐心等待。
所以,我已经模拟了一些调用,但与上面的例子相反,我不确定如何从数据库返回Any返回参数。这是我的问题。
import java.sql.DriverManager
import java.util.Properties
val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)
def myfunc(index: Int, iter: Iterator[String]) : Iterator[Any] = {
val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
val jdbcPort = 4497
val jdbcDatabase = "Rfam"
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
val jdbcUsername = "rfamro"
val jdbcPassword = ""
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
iter.map { x => val val1 = x;
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
while ( resultSet.next() ) {
val hInType = resultSet.getString("type")
}
}
}
rdd1.mapPartitionsWithIndex(myfunc).collect()
我得到空数据,我明白了,但我不确定我想要的是什么,或者如何修改方法。我正在考虑保留分区点。例如。
下面的方法当然很好,但很容易理解 - 即使对我来说也是如此!
iter.map(x => index + "," + (x, x, x+100))
所以,我试过这个,但总是得到null输出。我想我可能尝试的东西无法奏效。我得到了编译器认为它可以直接进行最后声明的印象。真正?我也假设每个分区只进行一次连接 - 现在不确定。
...
var fruits = new ListBuffer[String]()
iter.map { x => val val1 = x;
println (x)
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
while ( resultSet.next() ) {
val hInType = resultSet.getString("type")
fruits += hInType
}
}
return fruits.toList.toIterator
答案 0 :(得分:0)
这是有效的,但是一种完全不同的方法,不能确定上述情况
import java.util.Properties
import scala.collection.mutable.ListBuffer
import java.sql.{Connection, Driver, DriverManager, JDBCType, PreparedStatement, ResultSet, ResultSetMetaData, SQLException}
def readMatchingFromDB(record: String, connection: Connection) : String = {
var hInType: String = "XXX"
val val1 = record
val statement = connection.createStatement()
val resultSet = statement.executeQuery(s"""(select MAX(type) as type from family where type like '${val1}' ) """) // when doing MAX must do as so next line works
while ( resultSet.next() ) {
hInType = resultSet.getString("type")
}
return hInType // Only returning 1 due to MAX
}
val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)
val newRdd = rdd1.mapPartitions(
partition => {
val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
val jdbcPort = 4497
val jdbcDatabase = "Rfam"
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
val jdbcUsername = "rfamro"
val jdbcPassword = ""
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
}).toList
connection.close()
newPartition.toIterator
}).collect