对等体,
我们需要使用Spark 2.1标准化几乎SQL工作负载处理。我们目前正在讨论三个选项:RDD,DataFrames和SparkSQL。经过一天的堆栈翻转,论文和网络梳理后,我在下面进行了比较。我在桌面上寻求反馈,特别是在性能和记忆方面。提前感谢。
+---------------------------+------------------+-----------------+--------------------------------------+ | Feature | RDD | Data Frame (DF) | Spark SQL | +---------------------------+------------------+-----------------+--------------------------------------+ | First-class Spark citizen | Yes | Yes | Yes | | Native? [4] | Core abstraction | API | Module | | Generation [5] | 1st | 2nd | 3rd | | Abstraction [4,5, | Low-level API | Data processing | SQL-based | | Ansi standard SQL | None | Some | near-ansi [5] | | Optimization | None | Catalyst [9] | Catalyst [9] | | Performance [3,4,8 | Mix views | Mix views | Mix Views | | Memory | ? | ? | ? | | Programming speed | Slow | Fast | Faster if dealing with SQL workloads | +---------------------------+------------------+-----------------+--------------------------------------+
[3] Introducing DataFrames in Apache Spark for Large Scale Data Science by data bricks
[4] Spark RDDs vs DataFrames vs SparkSQL by Hortonworks
[5] A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets When to use them and why by data bricks
[6] Introducing Apache Spark 2.0 by data bricks
[7] Spark RDD vs Spark SQL Performance comparison using Spark Java APIs
[8] Spark sql queries vs dataframe functions on Stackoverflow
[9] Spark SQL: Relational Data Processing in Spark by data bricks, MIT, UC Berkeley
编辑解释问题是如何不同而不是重复:
感谢您参考姐妹问题。虽然我看到了详细的讨论和一些重叠,但我看到最小的(没有?):
(a)关于SparkSQL的讨论,
(b)三种方法的记忆消耗比较,和
(c)Spark 2.x的性能比较(在我的问题中更新)。它引用了[4](有用),它基于spark 1.6
我认为我修改后的问题仍未得到答复。要求取消标记为副本。
答案 0 :(得分:0)
我的个人意见: