我想多次使用org.apache.flink.api.scala.DataSet对象:
对于每个动作,Flink完全重新计算DataSet的值而不是缓存它。我无法在Spark中找到任何cache()或persist()函数。
这确实对我的应用程序产生了巨大的影响,其中包含~1.000.000数据以及许多连接/ coGroup用法等:运行时似乎增加了3倍,这是几个小时!那么如何缓存或持久化数据集并显着减少运行时间呢?
我使用的是最新的Flink版本1.3.2和Scala 2.11。
示例:
package dummy
import org.apache.flink.api.scala._
import org.apache.flink.graph.scala.Graph
import org.apache.flink.graph.{Edge, Vertex}
import org.apache.logging.log4j.scala.Logging
object Trials extends Logging {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
// some dataset which could be huge in reality
val dataSet = env.fromElements((1, 436), (2, 235), (3, 67), (4, 51), (5, 15), (6, 62), (7, 155))
// some complex joins, coGroup functions etc.
val joined = dataSet.cross(dataSet).filter(tuple => (tuple._1._2 + tuple._2._2) % 7 == 0)
// log the number of rows --> performs the join above
logger.info(f"results contains ${joined.count()} rows")
// convert to Gelly graph format
val graph = Graph.fromDataSet(
dataSet.map(nodeTuple => new Vertex[Long, Long](nodeTuple._1, nodeTuple._2)),
joined.map(edgeTuple => new Edge[Long, String](edgeTuple._1._1, edgeTuple._2._1, "someValue")),
env
)
// do something with the graph
logger.info("get number of vertices")
val numberOfVertices = graph.numberOfVertices()
logger.info("get number of edges")
val numberOfEdges = graph.numberOfEdges() // --> performs the join again!
logger.info(f"the graph has ${numberOfVertices} vertices and ${numberOfEdges} edges")
}
}
必需的libs:log4j-core,log4j-api-scala_2.11,flink-core,flink-scala_2.11,flink-gelly-scala_2.10
答案 0 :(得分:0)
我认为,如果您需要在同一流上执行多项操作,则值得使用侧面输出-https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html。
一旦执行了一些复杂的联接,coGroup函数等,并获得了joined
数据集,就可以将值收集到不同的侧面输出中-一个稍后将计算计数,另一个将计算其他工作。