I am working on a problem where we have lot of different events coming from different sources and these events have 60% fields common. So, with that said, I initially started with created individual tables for each event and now see that there can many events and almost 60% data fields are same among these events, I am thinking of create one event table that will have columns for all events and I am going to add a type column in this table which will let my spark jobs pick events relevant to them. This table is a Hive external table, and spark jobs will load data into it by processing a staging json table.
I am seeking input from experts to see if this one table design is feasible?
My cluster has 6 DNs with 32Gig RAM on each and 5TB disk space each. Since spark is our core processing framework, I am worried about resource consumption for all the jobs that will run? What if partitions becomes too big? I am considering performance and speed too?
Any inputs are appreciated.
答案 0 :(得分:0)
在决定如何存储数据之前,需要考虑一些事项。
您的数据每天增长多少?这是非常重要的一点,你会生成很多小文件吗?如果是这种情况,您应该考虑使用中间过程创建新文件并使文件更大,以至少与块大小匹配,考虑到群集的大小,这是一个重要的点。
请注意如何对数据进行分区,这些数据非常精细,可能会在很多小文件中结束,这些文件会影响您的性能。你真的需要客户和事件类型的分区吗?
我希望它有助于做出一些决定。
编辑:回答一些问题
当您修改镶木桌的列结构时,Hive似乎有一些限制。例如,要修改表定义中列的名称,您必须使用标志为parquet.column.index.access才能使其工作,这意味着您需要的所有数据都包含相同的模式。 Hive中的替换列添加一个全新的定义在Hive 1.3版本中不起作用,由于某种原因,我无法读取新列,不确定这是否在其他版本中得到修复。
此外,spark中的模式演变被关闭,因为它更昂贵,基本上你必须阅读所有文件并合并模式以这种方式工作,并且根据你的文件数量,这会影响性能< / p>
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging