我正在尝试从SQL Server数据库中读取一些输出到PySpark。
数据格式如下:
{
"var1": 0,
"var2": 0,
"var3": 1,
"var4": null,
"var5": 22156,
"var6": 0,
"var7": 1,
"var8": "Denver",
"var9": "Colorado",
"var10": null,
"var11": "2019-02-03",
"var12": "5E98915C-FFE0-11E8-931E-0242AC11000A",
"var13": null,
"var14": null,
"var15": "USA",
"var16": "freight",
"var17": null,
"var18": null,
"var19": "denv",
"var20": "e385dbf6-8b27-4779-a566-f5b75feaf83e",
"var21": 1547467268304,
"var22": "3281523"
}
{
"var1": 0,
"var2": 0,
"var3": 1,
"var4": null,
"var5": 65618,
"var6": 0,
"var7": 1,
"var8": "Orlando",
"var9": "Florida",
"var10": null,
"var11": "2018-12-14",
"var12": "578F16D8-FFE0-11E8-931E-0242AC11000A",
"var13": null,
"var14": null,
"var15": "USA",
"var16": "air",
"var17": null,
"var18": null,
"var19": "orla",
"var20": "4a231f2c-fe1d-46e8-bc4a-d3333a081566",
"var21": 1547467268305,
"var22": "3281523"
}
注意:为简洁起见,我仅记录了几条记录
然后我使用以下(简单)代码将其读入PySpark:
df = spark.read.option("multiline", "true").json("/Users/davidmoors/Documents/MSSQL_Output_Example")
print("--- JSON file read in: %s seconds ---" % (time.time() - start_time))
df.printSchema()
print(df.count())
df.show(5)
我得到以下输出:
--- JSON file read in: 1.6511101722717285 seconds ---
root
|-- var1: long (nullable = true)
|-- var10: string (nullable = true)
|-- var11: string (nullable = true)
|-- var12: string (nullable = true)
|-- var13: string (nullable = true)
|-- var14: string (nullable = true)
|-- var15: string (nullable = true)
|-- var16: string (nullable = true)
|-- var17: string (nullable = true)
|-- var18: string (nullable = true)
|-- var19: string (nullable = true)
|-- var2: long (nullable = true)
|-- var20: string (nullable = true)
|-- var21: long (nullable = true)
|-- var22: string (nullable = true)
|-- var3: long (nullable = true)
|-- var4: string (nullable = true)
|-- var5: long (nullable = true)
|-- var6: long (nullable = true)
|-- var7: long (nullable = true)
|-- var8: string (nullable = true)
|-- var9: string (nullable = true)
1
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
|var1|var10| var11| var12|var13|var14|var15| var16|var17|var18|var19|var2| var20| var21| var22|var3|var4| var5|var6|var7| var8| var9|
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
| 0| null|2019-02-03|5E98915C-FFE0-11E...| null| null| USA|freight| null| null| denv| 0|e385dbf6-8b27-477...|1547467268304|3281523| 1|null|22156| 0| 1|Denver|Colorado|
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
由于某种原因,我的代码仅读取第一组记录,而不读取第二组记录?
有人可以建议我如何将所有数据读入PySpark吗?我有一个很大的文件,以这种格式存储了数百万条记录。
谢谢。