Question

这是我的代码和绘图结果，一些异常值的费用，x轴很长。有没有一种简单的方法可以在R中过滤df$foo仅0-90％或0-95％百分位，这样我只能绘制正常值？感谢。

df <- read.csv('~/Downloads/foo.tsv', sep='\t', header=F, stringsAsFactors=FALSE)
names(df) <- c('a', 'foo', 'goo')
df$foo <- as.numeric(df$foo)
goodValue <- df$foo
summary(goodValue)
hist(goodValue,main="Distribution",xlab="foo",breaks=20)

Answer 1

也许这就是你要找的东西？

a = c(rnorm(99), 50) #create some data 
quant <- as.numeric(quantile(a, c(0, 0.9))) #get 0 and 0.9 quantile
hist(a[a > quant[1] & a < quant[2]]) #histogram only data within these bounds

Answer 2

假设您想检查钻石。（我没有你的数据）

library(ggplot2)
library(dplyr)
diamonds %>% ggplot() + geom_histogram(aes(x = price))

您可能决定检查数据的十分位数，并且由于尾部概率对您不感兴趣，您可能会丢弃最高的最高十分位数。您可以按照以下方式执行此操作，并使用自由缩放比例，以便您可以查看每个十分位数内发生的情况。

diamonds %>% mutate(ntile = ntile(price, 10)) %>% 
  filter(ntile < 10) %>%
  ggplot() + geom_histogram(aes(x = price)) + 
  facet_wrap(~ntile, scales = "free_x")

但要谨慎尽管以更精细的粒度查看您的数据有其好处，但请注意您几乎几乎不能说您的数据大致呈指数级分布（尾巴很重，因为商品价格数据）经常是）。

仅显示0-90％或0-95％百分位数

2 个答案: