我有一个数据集,其中包含房屋的邮政编码和每座房屋的价格。我需要根据邮政编码的平均价格将其分为三个数据集。例如,一组带有最高价格,平均价格和最低价格的邮政编码。
我的想法是根据价格从最低到最高对数据集进行排序,将其分成三份,然后查看每个邮政编码显示最多的位置,但这感觉效率很低。有更好的方法吗?
答案 0 :(得分:1)
这是使用dplyr的解决方案。这有点冗长,但是可以完成工作。使用group_by可以计算每个邮政编码的平均价格,以便您可以根据昂贵,平均和便宜的邮政编码更精确地进行划分。
library(dplyr)
# Generate sample data
dat <- tibble(postcode = sample(c("5432", "5654", "2342", "1231", "8543", "4324"), 1000, replace = TRUE),
price = rnorm(1000, 400000, 50000))
# Work out mean price for each postcode
mean_prices <- dat %>%
group_by(postcode) %>%
summarise(mean_price = mean(price))
# Find split points for the mean postcode price
split_points <- quantile(unique(mean_prices$mean_price), (1:3)/3)
# Get the postcodes that are within cheap, middle, or expensive price ranges
cheap_postcodes <- mean_prices %>%
filter(mean_price <= split_points[1]) %>%
pull(postcode)
middle_postcodes <- mean_prices %>%
filter(mean_price > split_points[1] & mean_price <= split_points[2]) %>%
pull(postcode)
expensive_postcodes <- mean_prices %>%
filter(mean_price > split_points[2]) %>%
pull(postcode)
# Create the three datasets
cheap_third <- dat %>% filter(postcode %in% cheap_postcodes)
middle_third <- dat %>% filter(postcode %in% middle_postcodes)
expensive_third <- dat %>% filter(postcode %in% expensive_postcodes)