如何在现有data.frame中添加与data.frame中已经存在的特定列对齐的其他列?

时间:2019-04-16 19:08:30

标签: r data.table

我是R的新用户,尝试复制左侧的基本联接并进行更新(通常在SQL中执行)时遇到了麻烦。我已经检查了几个有关Stackoverflow的先前提出的问题,但仍然不能完全正确地使用此代码。
我一直在尝试构建一个data.frame,从一个仅代表所有可能邮政编码的data.frame开始。我还有几个其他data.frames,每个数据帧都在一定范围内(例如1990-1999)计算建造年限,并按邮政编码分组。请注意,每个后续data.frame只是第一个data.frame中邮政编码的子集。本质上,我想做的是构建一个表,从代表所有可能邮政编码的data.frame开始,并将每个单独的range data.frame链接到该表,以便最终表将显示每个范围的所有范围邮政编码。每个范围data.frame将需要与“ ZIPS_ALL”变量对齐。 1990-1999、2000-2009和Zips_ALL数据框如下:

    1990-1999           2000-2009         zip_codes_all
    ZIP     Count       ZIP     Count     ZIPS_ALL
    19145     1         19145     1       19145
    19146     2         19147     3       19146
    19147     2         19148     1       19147 
                                          19148

我尝试使用几种不同的Left_Joins或从dplyr / base_r进行合并,但是当尝试附加每个范围时,它会覆盖先前的范围,因此我的最终表仅包含邮政编码,并且仅包含最终范围。我需要保留表的所有范围,以便最终表显示“所有邮政编码”中的所有邮政编码,并与ZIPS_ALL变量对齐。

    1990_1999_df <- left_join(x = zip_codes_all, y = 1990-1999, by = 
    c("ZIP_ALL" = "ZIP"))
    2000_2009_df <- left_join(x = zip_codes_all, y = 2000-2009, by = 
    c("ZIP_ALL" = "ZIP"))

预期结果将所有范围data.frame与所有可能的邮政编码data.frame对齐,其中缺少的条目将只是NA值;见下文:

    1990-1999   2000-2009   zip_codes_all
    Count       Count       ZIPS_ALL
    1           1           19145
    2           NA          19146
    2           1           19147
    NA          1           19148

我的zip_codes_all变量的Dput代码为:

dput(droplevels(zip_codes_all[1:10,]))
structure(list(ZIP_ALL = c(23115L, 22960L, 22578L, 23936L, 23308L, 
23875L, 23518L, 23139L, 23917L, 22967L)), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x0000000000201ef0>, class = 
c("data.table", 
"data.frame"))

我的更新代码带有实际变量名。这段代码有效,但是我想知道是否有一种更有效的方法可以不必手动添加每个范围,因为我需要构建很多范围。

#create your range counts by group
nn_data_1939_range <- nn_data[yearbuilt <= 1939 ,.N, by = ZIP][order(ZIP)]
nn_data_1949_range <- nn_data[yearbuilt >= 1940 & yearbuilt <= 1949 ,.N, by = ZIP][order(ZIP)]
nn_data_1959_range <- nn_data[yearbuilt >= 1950 & yearbuilt <= 1959 ,.N, by = ZIP][order(ZIP)]
nn_data_1969_range <- nn_data[yearbuilt >= 1960 & yearbuilt <= 1969 ,.N, by = ZIP][order(ZIP)]
nn_data_1979_range <- nn_data[yearbuilt >= 1970 & yearbuilt <= 1979 ,.N, by = ZIP][order(ZIP)]
nn_data_1989_range <- nn_data[yearbuilt >= 1980 & yearbuilt <= 1989 ,.N, by = ZIP][order(ZIP)]
nn_data_1999_range <- nn_data[yearbuilt >= 1990 & yearbuilt <= 1999 ,.N, by = ZIP][order(ZIP)]
nn_data_2004_range <- nn_data[yearbuilt >= 2000 & yearbuilt <= 2004 ,.N, by = ZIP][order(ZIP)]
nn_data_2005_range <- nn_data[yearbuilt >= 2005,.N, by = ZIP][order(ZIP)]


#Build your table by each range; adding each range to the previously created data.frame; join zip_all to zip
tbl_LessThan_1939 <- left_join(x = zip_codes_all, y = nn_data_1939_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1949 <- left_join(x = tbl_LessThan_1939, nn_data_1949_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1959 <- left_join(x = tbl_0_1949, nn_data_1959_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1969 <- left_join(x = tbl_0_1959, nn_data_1969_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1979 <- left_join(x = tbl_0_1969, nn_data_1979_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1989 <- left_join(x = tbl_0_1979, nn_data_1989_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1999 <- left_join(x = tbl_0_1989, nn_data_1999_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_2004 <- left_join(x = tbl_0_1999, nn_data_2004_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_present <- left_join(x = tbl_0_2004, nn_data_2005_range, by = c("ZIP_ALL" = "ZIP"))

1 个答案:

答案 0 :(得分:0)

好的,我最好的猜测是您的数据看起来像这样(尽管可能更大):

library(data.table)
set.seed(47)
nn_data_sample = data.table(
  yearbuilt = rep(c(1938, 1942, 1951, 1963), each = 4),
  ZIP = sample(c(90210, 19145, 19146, 19147, 19148, 19149), size = 16, replace = TRUE)
)
nn_data_sample
 #    yearbuilt   ZIP
 # 1:      1938 19149
 # 2:      1938 19146
 # 3:      1938 19148
 # 4:      1938 19148
 # 5:      1942 19147
 # 6:      1942 19148
 # 7:      1942 19146
 # 8:      1942 19146
 # 9:      1951 19147

这是格式很好的数据,长格式,易于使用。您似乎想要(a)按邮政编码和行的建立年代计数(或多或少,最近粒度有所增加),然后(b)转换长数据(带有一个邮政编码列和一个时间列)转换为宽格式,其中时间分布在许多列中。

对于(a),我们将使用cut函数将年份划分为所需的类似十年的间隔,然后按邮政编码和十年来汇总行。

decade_data = nn_data_sample[, decade_built := cut(
  yearbuilt,
  breaks = c(0, seq(1939, 1999, by = 10), 2004, Inf))
][, .(n = .N), by = .(decade_built, ZIP)]

decade_data
 #    decade_built   ZIP n
 # 1:     (0,1939] 19149 1
 # 2:     (0,1939] 19146 1
 # 3:     (0,1939] 19148 2
 # 4:  (1939,1949] 19147 1
 # 5:  (1939,1949] 19148 1
 # 6:  (1939,1949] 19146 2
 # 7:  (1949,1959] 19147 1
 # 8:  (1949,1959] 19149 1
 # ...

对于许多用例来说,这是一种很好的格式,可以使用它--- data.table使“分组”操作变得容易,因此,如果您想在每十年进行更多操作,这应该是您的起点。 (由于我们使用:= decade_built列已成为原始数据的一部分,因此您可以查看该列以验证其是否可以正常工作。)

但是,如果您想更改为宽格式,dcast会为我们做到这一点:

dcast(decade_data, ZIP ~ decade_built, value.var = "n")
#      ZIP (0,1939] (1939,1949] (1949,1959] (1959,1969]
# 1: 19146        1           2          NA          NA
# 2: 19147       NA           1           1           2
# 3: 19148        2           1           1          NA
# 4: 19149        1          NA           1           1
# 5: 90210       NA          NA           1           1

如果要编辑列名,则可以使用labels函数的cut参数从顶部指定想要的内容,也可以在末尾简单地重命名。或者在中间做,在创建后decade_built列的值上进行修改---在感觉最简单的地方进行。