我完全不喜欢火花;我有一个像这样的火花数据框:
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
| time| hostname|group| mountpoint| inode| size| ctime| mtime| atime|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...| 23| 92|2012-09-03 15:14:56|2012-04-10 19:08:05|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...| 22| 74|2012-09-03 15:14:56|2011-08-09 16:16:40|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...|189926604|167541|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...|189926608| 354|2012-09-19 05:47:49|2009-12-22 17:06:15|2013-03-11 20:35:23|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...|189926601| 10580|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
only showing top 5 rows
我想获得atime
的直方图分布的图形图。在pyspark(jupyter)上执行此操作的最有效方法是什么?我将拥有十亿行数据。
我还喜欢CDF,其中size
的文件总和被绘制,以显示根据上次访问文件的时间而未访问的数据(以字节为单位)。