我一直在尝试整合点火和火花。我的应用程序的目标是向/从点火写入和读取火花数据帧。但是,我面临着更大的数据集(> 2亿行)的几个问题。
我有一个在YARN上运行的6节点Ignite集群。它具有160Gb内存和12个核心。我正在尝试在Ignite缓存(分区1个备份)中使用spark(大约20Gb的原始文本数据)保存数据帧:
def main(args: Array[String]) {
val ignite = setupIgnite
closeAfter(ignite) { _ ⇒
implicit val spark: SparkSession = SparkSession.builder
.appName("Ignite Benchmark")
.getOrCreate()
val customer = readDF("csv", "|", Schemas.customerSchema, "hdfs://master.local:8020/apps/hive/warehouse/ssbplus100/customer")
val part = readDF("csv", "|", Schemas.partSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/part")
val supplier = readDF("csv", "|", Schemas.supplierSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/supplier")
val dateDim = readDF("csv", "|", Schemas.dateDimSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/date_dim")
val lineorder = readDF("csv", "|", Schemas.lineorderSchema, "hdfs:// master.local:8020/apps/hive/warehouse/ssbplus100/lineorder")
writeDF(customer, "customer", List("custkey"), TEMPLATES.REPLICATED)
writeDF(part, "part", List("partkey"), TEMPLATES.REPLICATED)
writeDF(supplier, "supplier", List("suppkey"), TEMPLATES.REPLICATED)
writeDF(dateDim, "date_dim", List("datekey"), TEMPLATES.REPLICATED)
writeDF(lineorder.limit(200000000), "lineorder", List("orderkey, linenumber"), TEMPLATES.NO_BACKUP)
}
}
在某个时刻,spark应用程序检索到此错误:
class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Out of memory in data region [name=default, initSize=256.0 MiB, maxSize=12.6 GiB, persistenceEnabled=false] Try the following:
^-- Increase maximum off-heap memory size (DataRegionConfiguration.maxSize)
^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
^-- Enable eviction or expiration policies
at org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(PageMemoryNoStoreImpl.java:304)
at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.allocateDataPage(AbstractFreeList.java:463)
at org.apache.ignite.internal.processors.cache.persistence.freelist.AbstractFreeList.insertDataRow(AbstractFreeList.java:501)
at org.apache.ignite.internal.processors.cache.persistence.RowStore.addRow(RowStore.java:97)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.createRow(IgniteCacheOffheapManagerImpl.java:1302)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4426)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$UpdateClosure.call(GridCacheMapEntry.java:4371)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:3083)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$6200(BPlusTree.java:2977)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1726)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1703)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1703)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1610)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1249)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:352)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3602)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2774)
at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2125)
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:140)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:305)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:60)
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:90)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125)
at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1091)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.run(StripedExecutor.java:511)
at java.lang.Thread.run(Thread.java:748)
我认为问题出在点火会话之前启动ignite服务器,就像官方的ignite示例一样。该服务器开始缓存我正在写入点火缓存的数据,并且超出其默认区域最大值max(12Gb,这与我为纱线簇定义的20GB不同)。但是,我不明白这些示例和文档如何告诉我们在spark上下文(以及我假设的会话)之前创建点火服务器。我知道没有此功能,应用程序将在所有spark作业终止后挂起,但是我不明白在spark应用程序上安装服务器以开始缓存数据的逻辑。这个概念让我很困惑,现在,我已经在spark内部设置了这个ignite实例作为客户端。
这是一个奇怪的行为,因为我的所有点火节点(在YARN上运行)都为默认区域定义了20GB(我对其进行了更改并进行了验证)。这表明我该错误必须来自在Spark上启动的ignite服务器(我认为这是驱动程序上的一个,每个工作人员一个),因为我没有更改spark应用程序的ignite-config.xml中的默认区域大小。 (错误显示默认为12GB)。但是,这有意义吗? Spark是否应该抛出此错误,这是它读取/写入数据并着火的唯一目标? Spark是否参与缓存任何数据,这是否意味着尽管官方示例未使用客户端模式,但我仍应在应用程序的ignite-config.xml中设置客户端模式?
最好的问候, 卡洛斯
答案 0 :(得分:2)
首先,Spark-Ignite连接器已经connects in client mode。
我将假设您有足够的内存,但是可以确保遵循Capacity Planning指南中的示例。
但是,我认为问题在于您对示例应用程序的关注太近了(!)。该示例-为了完全独立-包括服务器和Spark客户端。如果您已经拥有Ignite群集,则您无需在Spark客户端中启动服务器。
这是一个来自实际应用程序的示例,该示例略有改动(在Java中,很抱歉):
try (SparkSession spark = SparkSession
.builder()
.appName("AppName")
.master(sparkMaster)
.config("spark.executor.extraClassPath", igniteClassPath())
.getOrCreate()) {
// Get source DataFrame
DataSet<Row> results = ....
results.write()
.outputMode("append")
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), igniteCfgFile)
.option(IgniteDataFrameSettings.OPTION_TABLE(), "Results")
.option(IgniteDataFrameSettings.OPTION_STREAMER_ALLOW_OVERWRITE(), true)
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS(), "name")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "backups=1")
.write();
}
我没有进行测试,但是您应该了解一下:您需要提供指向Ignite配置文件的URL;它会创建客户端以在后台连接到该服务器。