在火花数据帧中绘制日期频率的直方图

时间:2016-09-24 07:39:29

标签: python date apache-spark pyspark histogram

我完全不喜欢火花;我有一个像这样的火花数据框:

+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
|               time|  hostname|group|          mountpoint|    inode|  size|              ctime|              mtime|              atime|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|       23|    92|2012-09-03 15:14:56|2012-04-10 19:08:05|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|       22|    74|2012-09-03 15:14:56|2011-08-09 16:16:40|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|189926604|167541|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|189926608|   354|2012-09-19 05:47:49|2009-12-22 17:06:15|2013-03-11 20:35:23|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|189926601| 10580|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
only showing top 5 rows

我想获得atime的直方图分布的图形图。在pyspark(jupyter)上执行此操作的最有效方法是什么?我将拥有十亿行数据。

我还喜欢CDF,其中size的文件总和被绘制,以显示根据上次访问文件的时间而未访问的数据(以字节为单位)。

0 个答案:

没有答案