it could be a few different things ...
- json can take a while to serialize especially if you are using the R or Python api because it is a separate process and needs to go back and forth between the native JVM executors on the worker nodes to serialize/deserialize objects
- if you performed any "wide transformations" like aggregation or join prior to df.collect() you most likely triggered a shuffle which will result in 200 default partitions being written to disk hence when you call collect it has to retrieve that data from disk which is slower than retrieving from RAM
- although your data set is small you may need to increase the default executor RAM, executor cores (slots), # of executors, and re-config the # of partitions to get more parallelism
check number of partitions
df.rdd.getNumPartitions()
check shuffle partitions
spark.conf.get("spark.sql.shuffle.partitions")
check other configs like executor ram, cores, and instances
spark.sparkContext.getConf().getAll()
spark is a tough beast to tackle ... best visit their official documentation to learn more! https://spark.apache.org/