好的,所以我真的被卡住了。我有一个如下所示的数据集:
Species Latitude Longitude Oiling Condition BirdCount Date_ Oil_Cond Date week.number
1 Northern Gannet 30.32860 -89.19810 Not Visibly Oiled Live 1 2010-07-21 1 2010-07-21 30
2 Laughing Gull 30.23172 -88.32127 Not Visibly Oiled Live 1 2010-05-05 1 2010-05-05 19
3 Northern Gannet 30.26677 -87.59248 Visibly Oiled Live 1 2010-05-05 2 2010-05-05 19
4 American White Pelican 29.29649 -89.66432 Not Visibly Oiled Live 1 2010-05-05 1 2010-05-05 19
5 Brown Pelican 29.88244 -88.87624 Visibly Oiled Live 1 2010-05-08 2 2010-05-08 19
6 Brown Pelican 29.00290 -89.36961 Not Visibly Oiled Live 1 2010-05-14 1 2010-05-14 20
7 Northern Gannet 30.33390 -85.56565 Unknown Live 1 2010-05-17 6 2010-05-17 21
8 Common Loon 30.28177 -87.51028 Not Visibly Oiled Live 1 2010-05-17 1 2010-05-17 21
9 Brown Pelican 30.41410 -88.24542 Visibly Oiled Live 1 2010-05-18 2 2010-05-18 21
10 Northern Gannet 30.24063 -88.12451 Not Visibly Oiled Live 1 2010-05-18 1 2010-05-18 21
我正试图获得一个多面直方图,绘制变量Oil_Cond,用于5种最常见的鸟类(有超过100种独特的鸟类)。
起初我想制作一个包含所有物种的方面,并使用以下代码:
qplot(Oil_Cond, data = birds, facets = Species ~., geom = "histogram")
但是,当然,那超载并且不会起作用,因为会有超过100个方面。所以我决定我真的只关心前5种,我弄清楚它们是什么以及它们出现的频率(Laughing Gull:3036,Brown Pelican:789,Northern Gannet:546,Royal Tern:321, Black Skimmer:258)。但是,我不知道该怎么做。
非常感谢任何帮助。
谢谢:)
艾米
答案 0 :(得分:3)
这里最简单的方法可能是简单地绘制数据的子集。唯一可能需要注意的是物种变量是否存储为因子,而不是字符串。首先创建一个子集:
birdsSub <- subset(birds, Species %in% c('Laughing Gull','Brown Pelican',
'Northern Gannet','Royal Tern','Black Skimmer'))
birdsSub$Species <- droplevels(birdsSub$Species)
然后您应该能够像以前一样将此数据框传递给qplot
。 droplevels
的原因在于,如果将该变量存储为一个因子,那么不再出现的所有物种将作为未使用的因子水平“出现”,并且您将最终得到所有100个面板,除了五个以外都是空的。
答案 1 :(得分:1)
你可以使用优秀的plyr包...
解决这个问题# If you don't already have plyr installed, uncomment the next line:
# install.packages('plyr')
require(plyr)
# First, find out how many of each species you have...
ns=ddply(birds,.(Species),summarise,n=length(Species))
# This will produce a table listing the number of each species you have
# (in the column 'n'). Type 'ns' to see the table.
# We can then rank the species occurrence, to see how important the different
# species are
ns$r = rank(-ns$n) # negative because 'rank' starts with the lowest number.
# have a look at the top 5 species:
subset(ns,r<=5)
# There are a couple of ways to proceed from here. Either we could get the
# top 5 species names from this 'ns' table:
# names=as.character(subset(ns,r>=5)$Species)
# and use joran's method, or we could merge the ns table and the original
# dataset (so that each species has an 'n' and 'r' attribute) and subset the
# data by species number or rank. I prefer the latter, as it allows you to
# flexibly change the species number threshold. i.e.:
birds=merge(birds,ns,by='Species')
# We've now added 'n' and 'r' columns to the birds data, so we can select
# our subset based on either of these columns:
birds.by.r=subset(birds,r<=5) # selects only the top 5 bird species
birds.by.n=subset(birds,r>=100) # selects all species with over 100 occurrences
# Then just plot away!
qplot(Oil_Cond,data=birds.by.r,facets=Species~.,geom='histogram')
# or
qplot(Oil_Cond,data=birds.by.n,facets=Species~.,geom='histogram')