geom_bin2d

Question

我有一个数据集，其中包含连续四天每秒的观察结果（大约340'000个数据点）。这在散点图中显示太多。我想只绘制2000个时间点的统一样本。

是否有可能通过ggplot2的“图形语法”方法实现这一目标？我还没有找到任何内置的“sampling”修饰符，但也许写一个很容易？

library(ggplot2)

x <- 1:100000
d <- data.frame(x=x, y=rnorm(length(x)))
ggplot(d[sample(x, 2000), ], aes(x=x, y=y)) + geom_point()

这是通过修改传递给ggplot的数据来“破解”的方法。但我不想修改数据，只是过滤它只包含一个样本。

ggplot(d, aes(x=x, y=y)) + ??? + geom_point()

编辑：我特意寻找采样，而不是平滑或分组。我的数据显示了模拟特定过程的一秒钟所需的时间。模拟已经并行化，并且对于每个模拟的秒，我有每个所涉及的核的运行时间（总共8个）。我想通过仅绘制原始数据点来显示次优负载平衡。采样的原因只是300'000数据点对于散点图来说太过分了：绘图花费的时间太长而且可视化效果不佳。

Answer 1

如果你想为大数据创建一个散点图，这里有几个ggplot2个选项

他们来自This course by hadley

# upload all images to imgur.com
opts_chunk$set(fig.width = 5, fig.height = 5, dev = "png")
render_markdown(strict = T)


# some autocorrelated data
set.seed(1)
x <- 1:1e+05
d <- data.frame(x = x)
d$y <- arima.sim(list(order = c(1, 1, 0), ar = 0.9), n = 1e+05 - 1)
# the basic plot 
base_plot <- ggplot(d, aes(x = x, y = y))

geom_bin2d

您可以为binwidth和x变量设置y

base_plot + geom_bin2d(binwidth = c(200, 5))

enter image description here

geom_hex

您可以设置bins

的数量

base_plot + geom_hex(bins = 200)

enter image description here

小点

停止过度绘图

base_plot + geom_point(size = I("."))

enter image description here

使用更顺畅的

这依赖于平滑方法，可以为您提供所需的细节而不会崩溃或花费太长时间。在这种情况下，结的数量是通过反复试验选择的（也许你会想要更多细节）

library(mgcv)
base_plot + stat_smooth(method = "gam", formula = y ~ s(x, k = 50))

enter image description here

Answer 2

您可以使用data参数在geom_point调用中进行子集化：

... + geom_point(data=d[sample(x,2000),])

这样，您可以使用所有数据自由添加其他geom，例如，使用示例数据：

ggplot(d, aes(x=x, y=y)) + geom_hex() + geom_point(data=d[sample(x,2000),])

hexbin and sampled points

绘制时间序列的样本

2 个答案:

geom_bin2d

geom_hex

小点

使用更顺畅的