Question

Apache Spark Dataset API有两种方法，即head(n:Int)和take(n:Int)。

Dataset.Scala source包含

def take(n: Int): Array[T] = head(n)

在这两个函数之间找不到执行代码的任何差异。为什么API有两种不同的方法可以产生相同的结果？

Answer 1

我已经尝试过＆amp;发现head（n）和take（n）给出完全相同的副本输出。两者都只以ROW对象的形式产生输出。

DF.head（2）

[Row（Transaction_date = u'1 / 2/2009 6:17'，Product = u'Product1'，Price = u'1200'，Payment_Type = u'Mastercard'，Name = u'carolina'，City = u'Basildon'，State = u'England'，Country = u'United Kingdom'），Row（Transaction_date = u'1 / 2/2009 4:53'，Product = u'Product2'，Price = u'1200' ，Payment_Type = u'Visa'，Name = u'Betina'，City = u'Parkville'，State = u'MO'，Country = u'United States'）]

DF.take（2）

[Row（Transaction_date = u'1 / 2/2009 6:17'，Product = u'Product1'，Price = u'1200'，Payment_Type = u'Mastercard'，Name = u'carolina'，City = u'Basildon'，State = u'England'，Country = u'United Kingdom'），Row（Transaction_date = u'1 / 2/2009 4:53'，Product = u'Product2'，Price = u'1200' ，Payment_Type = u'Visa'，Name = u'Betina'，City = u'Parkville'，State = u'MO'，Country = u'United States'）]

Answer 2

  package org.apache.spark.sql
  /* ... */

  def take(n: Int): Array[T] = head(n)

Answer 3

原因是，在我看来，Apache Spark Dataset API试图模仿包含head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html的Pandas DataFrame API。

Answer 4

我认为这是因为Spark开发人员倾向于为它提供丰富的API，还有两种方法where和filter完全相同。

Apache Spark DataSet API：head（n：Int）vs take（n：Int）

4 个答案: