Question

我一直在使用java-sizeof库（https://github.com/phatak-dev/java-sizeof）并使用它来测量Apache Spark中的数据集大小。事实证明，Row对象非常大。像非常大 - 为什么会这样？

采用相当简单的架构：

root
 |-- account: string (nullable = true)
 |-- date: long (nullable = true)
 |-- dialed: string (nullable = true)
 |-- duration: double (nullable = true)

示例数据如下所示：

+-------+-------------+----------+--------+
|account|         date|    dialed|duration|
+-------+-------------+----------+--------+
|   5497|1434620384003|9075112643|   790.0|
+-------+-------------+----------+--------+

现在我们这样做：

val row = df.take(1)(0)
// row: org.apache.spark.sql.Row = [5497,1434620384003,9075112643,790.0]

现在我使用SizeEstimator

SizeEstimator.estimate(row)
// res19: Long = 85050896

81兆字节！对于单排！认为这是某种错误，我这样做：

SizeEstimator.estimate(df.take(100))
// res20: Long = 85072696

有趣的是，尽管拥有100倍的数据量，但它并没有大得多 - 只有大约20k。高于100，似乎是线性的。对于1,000行，它看起来像这样：

SizeEstimator.estimate(df.take(1000))
// res21: Long = 850711696

好吧，这比100行大10倍 - 或多或少是线性的。从测试中，它以线性方式增加，持续超过100行。基于这些测试，在大约100行之后，每行对象的成本仍然超过800 KB !!

出于好奇，我为相同的基础数据尝试了几种不同的对象类型。例如，以下是Array个Array个对象而不是Row个对象的结果：

SizeEstimator.estimate(
  df.map(r => (r.getString(0), r.getLong(1), r.getString(2), r.getDouble(3))).take(1)
)
// res22: Long = 216

好的，那好一点。更好的是，对于10行，它只有1976字节，对于100行，它只有19,616字节。绝对是朝着正确的方向前进。

然后，我将DataFrame编码为RDD[Array[Byte]]，其中每个Array[Byte]是二进制编码的Avro记录，其架构与基础{{1}相同}}。然后我做：

DataFrame

72字节 - 甚至更好！并且，对于100行，它是5,216个字节 - 每行大约52个字节，并且它从那里继续向下（对于1,000个记录，48,656个字节）。

因此，最好，SizeEstimator.estimate(encodedRdd.take(1)) // res23: Long = 72对象每Row重850k，而相同数据的二进制Row记录大约为50字节。

发生了什么事？

Answer 1

实际上Tag本身并不是那么大。这就是为什么当您占用更多行时，您看不到大小的重大变化。问题似乎是架构信息：

收集数据时，您实际获得if (cell.Tag != null) string filename = cell.Tag.ToString();
```
Row
```

GenericRowWithSchema

GenericRowWithSchema carries schema information：

val df = Seq((1, "foo"), (2, "bar")).toDF
df.first.getClass

// res12: Class[_ <: org.apache.spark.sql.Row] = 
//   class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

让我们确认这确实是问题的根源：
```
schema
```

假设：您看到的估计大小包括架构的大小：

class GenericRowWithSchema(values: Array[Any], 
  override val schema: StructType)

与收集的行大致相同。让我们从头开始创建一个新架构：

import com.madhukaraphatak.sizeof.SizeEstimator
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema

val rowWithSchema = df.first 
val rowWithoutSchema = new GenericRowWithSchema(
  rowWithSchema.toSeq.toArray, null)

SizeEstimator.estimate(rowWithSchema)
// Long = 1444255708

SizeEstimator.estimate(rowWithoutSchema)
// Long = 120

因此，您可以看到结果是一致的。

为什么架构如此之大？很难说。当您查看代码时，您会看到StructType是一个复杂的类，甚至不包括其伴随对象，而不是简单的模式定义。

虽然没有解释报告的大小。我怀疑SizeEstimator.estimate(df.schema) // Long = 1444361928可能会有些侥幸但我还不确定。
您可以进一步隔离问题但估算单个import org.apache.spark.sql.types._ val schema = StructType(Seq( StructField("_1",IntegerType,false), StructField("_2",StringType,true))) val anotherRowWithSchema = new GenericRowWithSchema( Array(0, "foo"), schema) SizeEstimator.estimate(anotherRowWithSchema) // Long = 1444905324的大小：
```
SizeEstimator
```

为什么Spark Row对象与等效结构相比如此之大？

1 个答案: