我是R的新用户,尝试复制左侧的基本联接并进行更新(通常在SQL中执行)时遇到了麻烦。我已经检查了几个有关Stackoverflow的先前提出的问题,但仍然不能完全正确地使用此代码。
我一直在尝试构建一个data.frame,从一个仅代表所有可能邮政编码的data.frame开始。我还有几个其他data.frames,每个数据帧都在一定范围内(例如1990-1999)计算建造年限,并按邮政编码分组。请注意,每个后续data.frame只是第一个data.frame中邮政编码的子集。本质上,我想做的是构建一个表,从代表所有可能邮政编码的data.frame开始,并将每个单独的range data.frame链接到该表,以便最终表将显示每个范围的所有范围邮政编码。每个范围data.frame将需要与“ ZIPS_ALL”变量对齐。 1990-1999、2000-2009和Zips_ALL数据框如下:
1990-1999 2000-2009 zip_codes_all
ZIP Count ZIP Count ZIPS_ALL
19145 1 19145 1 19145
19146 2 19147 3 19146
19147 2 19148 1 19147
19148
我尝试使用几种不同的Left_Joins或从dplyr / base_r进行合并,但是当尝试附加每个范围时,它会覆盖先前的范围,因此我的最终表仅包含邮政编码,并且仅包含最终范围。我需要保留表的所有范围,以便最终表显示“所有邮政编码”中的所有邮政编码,并与ZIPS_ALL变量对齐。
1990_1999_df <- left_join(x = zip_codes_all, y = 1990-1999, by =
c("ZIP_ALL" = "ZIP"))
2000_2009_df <- left_join(x = zip_codes_all, y = 2000-2009, by =
c("ZIP_ALL" = "ZIP"))
预期结果将所有范围data.frame与所有可能的邮政编码data.frame对齐,其中缺少的条目将只是NA值;见下文:
1990-1999 2000-2009 zip_codes_all
Count Count ZIPS_ALL
1 1 19145
2 NA 19146
2 1 19147
NA 1 19148
我的zip_codes_all变量的Dput代码为:
dput(droplevels(zip_codes_all[1:10,]))
structure(list(ZIP_ALL = c(23115L, 22960L, 22578L, 23936L, 23308L,
23875L, 23518L, 23139L, 23917L, 22967L)), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x0000000000201ef0>, class =
c("data.table",
"data.frame"))
我的更新代码带有实际变量名。这段代码有效,但是我想知道是否有一种更有效的方法可以不必手动添加每个范围,因为我需要构建很多范围。
#create your range counts by group
nn_data_1939_range <- nn_data[yearbuilt <= 1939 ,.N, by = ZIP][order(ZIP)]
nn_data_1949_range <- nn_data[yearbuilt >= 1940 & yearbuilt <= 1949 ,.N, by = ZIP][order(ZIP)]
nn_data_1959_range <- nn_data[yearbuilt >= 1950 & yearbuilt <= 1959 ,.N, by = ZIP][order(ZIP)]
nn_data_1969_range <- nn_data[yearbuilt >= 1960 & yearbuilt <= 1969 ,.N, by = ZIP][order(ZIP)]
nn_data_1979_range <- nn_data[yearbuilt >= 1970 & yearbuilt <= 1979 ,.N, by = ZIP][order(ZIP)]
nn_data_1989_range <- nn_data[yearbuilt >= 1980 & yearbuilt <= 1989 ,.N, by = ZIP][order(ZIP)]
nn_data_1999_range <- nn_data[yearbuilt >= 1990 & yearbuilt <= 1999 ,.N, by = ZIP][order(ZIP)]
nn_data_2004_range <- nn_data[yearbuilt >= 2000 & yearbuilt <= 2004 ,.N, by = ZIP][order(ZIP)]
nn_data_2005_range <- nn_data[yearbuilt >= 2005,.N, by = ZIP][order(ZIP)]
#Build your table by each range; adding each range to the previously created data.frame; join zip_all to zip
tbl_LessThan_1939 <- left_join(x = zip_codes_all, y = nn_data_1939_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1949 <- left_join(x = tbl_LessThan_1939, nn_data_1949_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1959 <- left_join(x = tbl_0_1949, nn_data_1959_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1969 <- left_join(x = tbl_0_1959, nn_data_1969_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1979 <- left_join(x = tbl_0_1969, nn_data_1979_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1989 <- left_join(x = tbl_0_1979, nn_data_1989_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1999 <- left_join(x = tbl_0_1989, nn_data_1999_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_2004 <- left_join(x = tbl_0_1999, nn_data_2004_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_present <- left_join(x = tbl_0_2004, nn_data_2005_range, by = c("ZIP_ALL" = "ZIP"))
答案 0 :(得分:0)
好的,我最好的猜测是您的数据看起来像这样(尽管可能更大):
library(data.table)
set.seed(47)
nn_data_sample = data.table(
yearbuilt = rep(c(1938, 1942, 1951, 1963), each = 4),
ZIP = sample(c(90210, 19145, 19146, 19147, 19148, 19149), size = 16, replace = TRUE)
)
nn_data_sample
# yearbuilt ZIP
# 1: 1938 19149
# 2: 1938 19146
# 3: 1938 19148
# 4: 1938 19148
# 5: 1942 19147
# 6: 1942 19148
# 7: 1942 19146
# 8: 1942 19146
# 9: 1951 19147
这是格式很好的数据,长格式,易于使用。您似乎想要(a)按邮政编码和行的建立年代计数(或多或少,最近粒度有所增加),然后(b)转换长数据(带有一个邮政编码列和一个时间列)转换为宽格式,其中时间分布在许多列中。
对于(a),我们将使用cut
函数将年份划分为所需的类似十年的间隔,然后按邮政编码和十年来汇总行。
decade_data = nn_data_sample[, decade_built := cut(
yearbuilt,
breaks = c(0, seq(1939, 1999, by = 10), 2004, Inf))
][, .(n = .N), by = .(decade_built, ZIP)]
decade_data
# decade_built ZIP n
# 1: (0,1939] 19149 1
# 2: (0,1939] 19146 1
# 3: (0,1939] 19148 2
# 4: (1939,1949] 19147 1
# 5: (1939,1949] 19148 1
# 6: (1939,1949] 19146 2
# 7: (1949,1959] 19147 1
# 8: (1949,1959] 19149 1
# ...
对于许多用例来说,这是一种很好的格式,可以使用它--- data.table使“分组”操作变得容易,因此,如果您想在每十年进行更多操作,这应该是您的起点。 (由于我们使用:=
decade_built
列已成为原始数据的一部分,因此您可以查看该列以验证其是否可以正常工作。)
但是,如果您想更改为宽格式,dcast
会为我们做到这一点:
dcast(decade_data, ZIP ~ decade_built, value.var = "n")
# ZIP (0,1939] (1939,1949] (1949,1959] (1959,1969]
# 1: 19146 1 2 NA NA
# 2: 19147 NA 1 1 2
# 3: 19148 2 1 1 NA
# 4: 19149 1 NA 1 1
# 5: 90210 NA NA 1 1
如果要编辑列名,则可以使用labels
函数的cut
参数从顶部指定想要的内容,也可以在末尾简单地重命名。或者在中间做,在创建后decade_built
列的值上进行修改---在感觉最简单的地方进行。