Question

我的数据是关于文件大小和处理文件的时间成本的。

绘制点图时，我得到以下结果：

ggplot(data,aes(filesize,time))+geom_point()

您可以看到图中有2条线。

如何提取生产线附近的所有数据以进行进一步分析？

关于学习什么的任何建议？预先谢谢你。

Answer 1

下一步是确定那些似乎更常见的比率，以便更容易分离出这些观察结果。

library(dplyr)

data %>%
  mutate(time_per_size = time/file_size) %>%
  ggplot(aes(time_per_size)) +
    geom_histogram(bins = 50) # 30 bins is default, fiddle to see what value captures the predominant ratios most cleanly

例如，使用@PavoDive的样本数据，我们可以使用此过程查看比率，并使用plotly交互式地查看峰值，发现峰值在1.5和3左右。

library(ggplot2); library(dplyr)
dt %>%
  mutate(time_per_size = y/x) %>%
  filter(time_per_size < 10) %>%
  ggplot(aes(time_per_size)) +
  geom_histogram(bins = 300) 
plotly::ggplotly(.Last.value)

Answer 2

我同意@ heds1的看法，即无论您是否知道，您的结果与[至少]第三个变量之间可能存在某种潜在的关系。

请参见以下有关虚拟数据的示例：

library(data.table)
library(ggplot2)

# try to mimic your data in the x axis. Include some random types
set.seed(1)
dt <- data.table(x = rbeta(3000, shape1 = 1.8, shape2 = 10), type = sample(LETTERS[1:5], 3000, TRUE))

# introduce a couple lines:
dt[type == "A", y := 3*x]
dt[type == "C", y := 1.5*x]

# and add some "white noise":
dt[!type %chin% c("A", "C"), y := abs(rnorm(.N, .5, .25))]

# see what you have:
plot(dt$x, dt$y)

# now see the light:
ggplot(dt, aes(x, y, colour = type))+geom_point()

我在散点图中发现了意外的行，如何提取行附近的所有数据以进行进一步分析？

2 个答案: