如何从数据集中获取数量最多的不同值

时间:2019-07-31 16:57:47

标签: r

我正在处理通过市长办公室网站获得的洛杉矶警方数据。从2017年至2018年,我正在尝试查看市议会第5区的费用和每项特定费用的金额。CHARGECITY_COUNCIL_DIST是我要查看的两个变量/列。 / p>

我使用table(ArrestData$CHARGE)来计算不同值的数量。

我意识到有2400多个唯一条目,因此大部分条目都被省略了。我想知道是否有代码可以查看LAPD大部分给出的5种“收费”。

此外,我正在尝试在一个特定的Council District(还是另一个变量/列)中找到前5个费用,是否有相应的代码?

除了: 如何在我的帖子中添加示例数据?在RStudio上要执行哪些步骤? 有人在上一篇文章中要求我执行此操作,但是我不确定如何执行此操作。他们告诉我使用dput(head(df,n)),但即使使用10行,我的数据也太大。他们告诉我通过RScript做到这一点,但我不确定他们的意思

2 个答案:

答案 0 :(得分:0)

我认为使用聚合函数可能会有所帮助。如果您的数据只是CHARGE和CITY_COUNCIL_DIST,那么代码可能看起来像这样:

aggregate(.~CITY_COUNCIL_DIST + CHARGE, ArrestData, count)

我在R方面还不是很先进,因此代码可能需要对您的实际数据进行一些调整。获得汇总后,您可以订购数据:

agg.data[order(agg.data, descending=TRUE),]

我真的对dput没有帮助,对不起!

答案 1 :(得分:0)

发布对实际数据集/样本数据的引用将有助于创建解决方案。这将有助于该帖子遵守其他人提到的可重复性标准。为了这个示例,我们将显式创建一个数据集。

ArrestData <- data.frame(
  CHARGE=c("CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
           "CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
           "CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
           "CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
           "CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
           "CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
           "CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
           "CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
           "CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
           "CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
           "CHARGEF","CHARGEF","CHARGEF","CHARGEF",
           "CHARGEF","CHARGEF","CHARGEF","CHARGEF",
           "CHARGEG","CHARGEG","CHARGEG",           
           "CHARGEG","CHARGEG","CHARGEG",
           "CHARGEH","CHARGEH",
           "CHARGEH","CHARGEH",
           "CHARGEI",
           "CHARGEI"
           ),
  CITY_COUNCIL_DIST=c(0,5)
)

假设您的数据集命名为ArrestData,并且您的CHARGE / CITY_COUNCIL_DIST也按照所述命名,则此代码应该可以工作。以下代码将包含所有CHARGE的前CITY_COUNCIL_DIST的前5个CITY_COUNCIL_DIST

#install these packages if you do not have them

install.packages("magrittr")
install.packages("dplyr")

#make sure these libraries are present
library(magrittr)
library(dplyr)

ArrestData %>% 
  group_by(CHARGE, CITY_COUNCIL_DIST) %>%
  summarize(count=n()) %>% 
  arrange(CITY_COUNCIL_DIST, desc(count)) %>%
  group_by(CITY_COUNCIL_DIST) %>% 
  mutate(rank = rank(desc(count), ties.method="min")) %>% 
  filter(rank<=5)

为了仅过滤出CITY_COUNCIL_DIST 5的结果,您需要将filter语句更改为如下所示:(取决于您的CITY_COUNCIL_DIST实际值是多少)

filter(rank<=5, CITY_COUNCIL_DIST==5)