将数据框转换为JSON需要大量时间

时间:2019-04-16 16:39:01

标签: json pyspark

我有一个10,000条记录的数据框,我想将其转换为JSON格式并发送回Web服务。但是df.toJSON()。collect()需要很多时间[〜10秒]。任何人都可以建议是否有减少这种时间的方法

df.toJSON()。collect()

1 个答案:

答案 0 :(得分:0)

it could be a few different things ...

  1. json can take a while to serialize especially if you are using the R or Python api because it is a separate process and needs to go back and forth between the native JVM executors on the worker nodes to serialize/deserialize objects
  2. if you performed any "wide transformations" like aggregation or join prior to df.collect() you most likely triggered a shuffle which will result in 200 default partitions being written to disk hence when you call collect it has to retrieve that data from disk which is slower than retrieving from RAM
  3. although your data set is small you may need to increase the default executor RAM, executor cores (slots), # of executors, and re-config the # of partitions to get more parallelism

check number of partitions

df.rdd.getNumPartitions()

check shuffle partitions

spark.conf.get("spark.sql.shuffle.partitions")

check other configs like executor ram, cores, and instances

spark.sparkContext.getConf().getAll()

spark is a tough beast to tackle ... best visit their official documentation to learn more! https://spark.apache.org/