从数据库读取JSON输出到PySpark

时间:2019-02-04 16:36:23

标签: python json apache-spark pyspark jsonlines

我正在尝试从SQL Server数据库中读取一些输出到PySpark。

数据格式如下:

{
  "var1": 0,
  "var2": 0,
  "var3": 1,
  "var4": null,
  "var5": 22156,
  "var6": 0,
  "var7": 1,
  "var8": "Denver",
  "var9": "Colorado",
  "var10": null,
  "var11": "2019-02-03",
  "var12": "5E98915C-FFE0-11E8-931E-0242AC11000A",
  "var13": null,
  "var14": null,
  "var15": "USA",
  "var16": "freight",
  "var17": null,
  "var18": null,
  "var19": "denv",
  "var20": "e385dbf6-8b27-4779-a566-f5b75feaf83e",
  "var21": 1547467268304,
  "var22": "3281523"
}
{
  "var1": 0,
  "var2": 0,
  "var3": 1,
  "var4": null,
  "var5": 65618,
  "var6": 0,
  "var7": 1,
  "var8": "Orlando",
  "var9": "Florida",
  "var10": null,
  "var11": "2018-12-14",
  "var12": "578F16D8-FFE0-11E8-931E-0242AC11000A",
  "var13": null,
  "var14": null,
  "var15": "USA",
  "var16": "air",
  "var17": null,
  "var18": null,
  "var19": "orla",
  "var20": "4a231f2c-fe1d-46e8-bc4a-d3333a081566",
  "var21": 1547467268305,
  "var22": "3281523"
}

注意:为简洁起见,我仅记录了几条记录

然后我使用以下(简单)代码将其读入PySpark:

df = spark.read.option("multiline", "true").json("/Users/davidmoors/Documents/MSSQL_Output_Example")
print("--- JSON file read in: %s seconds ---" % (time.time() - start_time))


df.printSchema()
print(df.count())
df.show(5)

我得到以下输出:

--- JSON file read in: 1.6511101722717285 seconds ---
root
 |-- var1: long (nullable = true)
 |-- var10: string (nullable = true)
 |-- var11: string (nullable = true)
 |-- var12: string (nullable = true)
 |-- var13: string (nullable = true)
 |-- var14: string (nullable = true)
 |-- var15: string (nullable = true)
 |-- var16: string (nullable = true)
 |-- var17: string (nullable = true)
 |-- var18: string (nullable = true)
 |-- var19: string (nullable = true)
 |-- var2: long (nullable = true)
 |-- var20: string (nullable = true)
 |-- var21: long (nullable = true)
 |-- var22: string (nullable = true)
 |-- var3: long (nullable = true)
 |-- var4: string (nullable = true)
 |-- var5: long (nullable = true)
 |-- var6: long (nullable = true)
 |-- var7: long (nullable = true)
 |-- var8: string (nullable = true)
 |-- var9: string (nullable = true)

1
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
|var1|var10|     var11|               var12|var13|var14|var15|  var16|var17|var18|var19|var2|               var20|        var21|  var22|var3|var4| var5|var6|var7|  var8|    var9|
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+
|   0| null|2019-02-03|5E98915C-FFE0-11E...| null| null|  USA|freight| null| null| denv|   0|e385dbf6-8b27-477...|1547467268304|3281523|   1|null|22156|   0|   1|Denver|Colorado|
+----+-----+----------+--------------------+-----+-----+-----+-------+-----+-----+-----+----+--------------------+-------------+-------+----+----+-----+----+----+------+--------+

由于某种原因,我的代码仅读取第一组记录,而不读取第二组记录?

有人可以建议我如何将所有数据读入PySpark吗?我有一个很大的文件,以这种格式存储了数百万条记录。

谢谢。

0 个答案:

没有答案