通过mapPartitions

时间:2018-05-30 07:56:12

标签: apache-spark

虽然这个例子很容易理解:

val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
    iter.map(x => index + "," + (x, x, x+100))
}
rdd1.mapPartitionsWithIndex(myfunc).collect() 

我一直在尝试通过mapPartitions中的JDBC调用获取一些数据,并允许进行一些基本的并行处理。实际上我提出的例子实际上并不是有效的,但为了论证,想象有一些JDBC源代码,让我们说,一些复杂的逻辑,不适合数据帧,易于RDD处理等。请耐心等待。

所以,我已经模拟了一些调用,但与上面的例子相反,我不确定如何从数据库返回Any返回参数。这是我的问题。

import java.sql.DriverManager
import java.util.Properties

val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)

def myfunc(index: Int, iter: Iterator[String]) : Iterator[Any] = {

    val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
    val jdbcPort = 4497
    val jdbcDatabase = "Rfam"
    val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
    val jdbcUsername = "rfamro"
    val jdbcPassword = ""
    val connectionProperties = new Properties()
    connectionProperties.put("user", s"${jdbcUsername}")
    connectionProperties.put("password", s"${jdbcPassword}")
    val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)

    iter.map { x => val val1 = x; 
                val statement = connection.createStatement()
                val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
                while ( resultSet.next() ) {
                        val hInType = resultSet.getString("type")
                } 
             }
}

rdd1.mapPartitionsWithIndex(myfunc).collect()

我得到空数据,我明白了,但我不确定我想要的是什么,或者如何修改方法。我正在考虑保留分区点。例如。

下面的方法当然很好,但很容易理解 - 即使对我来说也是如此!

    iter.map(x => index + "," + (x, x, x+100))

所以,我试过这个,但总是得到null输出。我想我可能尝试的东西无法奏效。我得到了编译器认为它可以直接进行最后声明的印象。真正?我也假设每个分区只进行一次连接 - 现在不确定。

...
var fruits = new ListBuffer[String]()

iter.map { x => val val1 = x; 
                println (x)
                val statement = connection.createStatement()
                val resultSet = statement.executeQuery(s"""(select DISTINCT type from family where type like '${val1}' ) """)
                while ( resultSet.next() ) {
                        val hInType = resultSet.getString("type")
                        fruits += hInType

                } 
          }

return fruits.toList.toIterator 

1 个答案:

答案 0 :(得分:0)

这是有效的,但是一种完全不同的方法,不能确定上述情况

import java.util.Properties
import scala.collection.mutable.ListBuffer
import java.sql.{Connection, Driver, DriverManager, JDBCType, PreparedStatement, ResultSet, ResultSetMetaData, SQLException}

def readMatchingFromDB(record: String, connection: Connection) : String = {

    var hInType: String = "XXX"
    val val1 = record 
    val statement = connection.createStatement()
    val resultSet = statement.executeQuery(s"""(select MAX(type) as type from family where type like '${val1}' ) """) // when doing MAX must do as so next line works

    while ( resultSet.next() ) {
            hInType = resultSet.getString("type")                       
        }   
    return hInType // Only returning 1 due to MAX
 }

val rdd1 = sc.parallelize(List("G%", "C%", "I%", "B%", "X%", "F%", "J%"), 3)
val newRdd = rdd1.mapPartitions(

      partition => {
         val jdbcHostname = "mysql-rfam-public.ebi.ac.uk"
         val jdbcPort = 4497
         val jdbcDatabase = "Rfam"
         val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
         val jdbcUsername = "rfamro"
         val jdbcPassword = ""
         val connectionProperties = new Properties()
         connectionProperties.put("user", s"${jdbcUsername}")
         connectionProperties.put("password", s"${jdbcPassword}")
         val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)

         val newPartition = partition.map(
           record => {  
                      readMatchingFromDB(record, connection)
                     }).toList

         connection.close()
         newPartition.toIterator  
     }).collect