Question

输入：

PySpark DF从具有复杂数据结构（许多嵌套字段）的JSON文件（先前ETL作业的输出）中读取。文件包含100,000多个记录。

我需要过滤具有非空字段'name.en'的记录。事先，我手动检查了输入JSON文件的内容-fieldc'name.en'对于所有记录的5％都是空的，因此我希望在输出中看到95,000条记录。

问题：

Spark读取数据并将其保留在MEMORY_ONLY（或MEMORY_AND_DISK）中时，大多数记录都消失了。

第一种情况-持续存在（无效）

df = sparkSession.read.format('json').schema(REQUIRED_SCHEMA).load(JSON_FILE_PATH)
df.persist(pyspark.StorageLevel.MEMORY_ONLY)
print(df.filter('name.en IS NOT NULL').count()) #returns 115 records instead of 95,000+

用persist()解释df.filter（'name.en is not NULL'）：

== Physical Plan ==
Filter (isnotnull(name#2) && isnotnull(name#2.en))
+- InMemoryTableScan [_id#0, field1#1, name#2, ..., field31#31], [isnotnull(name#2), isnotnull(name#2.en)]
     +- InMemoryRelation [_id#0, field1#1, name#2, ..., field31#31], true, 10000, StorageLevel(memory, 1 replicas)
           +- FileScan json [_id#0,field#1,name#2,..., field31#31] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/data/myfile.json], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_id:struct<oid:string>,field1:string,name:struct<en:string>,field3:struct<oi...

但是当我删除persist（）调用时，它将返回正确的记录数。 -案例2-没有持久性（按预期工作）

    df = sparkSession.read.format('json').schema(REQUIRED_SCHEMA).load(JSON_FILE_PATH)
    print(df.filter('name.en IS NOT NULL').count()) #returns 95,000+ records as expected

没有persist()的df.filter（'name.en is not NULL'）的解释：

== Physical Plan ==
Project [_id#0, field1#1, name#2, ... , field31#31]
+- Filter (isnotnull(name#2) && isnotnull(name#2.en))
  +- FileScan json [_id#0,field1#1,name#2,...,field31#31] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/data/myfile.json], PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<_id:struct<oid:string>,field1:string,name:struct<en:string>,field3:struct<oi...

因此唯一的区别是persist()呼叫的存在与否。是什么导致如此奇怪的结果？

过滤并保留后，复杂的PySpark数据框的记录就会消失

0 个答案: