将类别的代码值合并到R中的数据集

时间:2014-12-02 02:46:00

标签: r merge

我有一个政治捐赠数据集,其中包含字母数字代码中的行业类别。单独的文本文档列出了这些字母数字代码如何转换为行业名称,行业名称和行业类别名称。

例如," A1200",是农业企业部门的甘蔗产业类作物。我想知道如何在单独的列中将字母数字代码与其各自的扇区,行业和类别值配对。

目前,代码值数据集位于

    Catcode Catname     Catorder    Industry             Sector      
    A1200   Sugar cane  A01         Crop Production  Agribusiness

以及此行业捐赠数据集:

Business name    Amount donated    Year   Category
Sarah Farms      1000              2010   A1200

类别数据集约为444行,捐赠集约为1M行。我如何感受捐赠数据集,因此它看起来像这样。类别将是通用名称

    Catcode Catname     Catorder    Industry             Sector          Business name    Amount donated    Year   Category
    A1200   Sugar cane  A01         Crop Production  Agribusiness     Sarah Farms      1000              2010   A1200

我对这些论坛有点新意,所以如果有更好的方式来提出这个问题,请告诉我。谢谢你的帮助!

3 个答案:

答案 0 :(得分:2)

如果速度很重要,您可能需要使用data.tabledplyr。在这里,我稍微修改了您的示例数据以提供一些想法。

df1 <- data.frame(Catcode = c("A1200", "B1500", "C1800"),
                  Catname = c("Sugar", "Salty", "Butter"),
                  Catorder = c("cane A01", "cane A01", "cane A01"),
                  Industry = c("Crop Production", "Crop Production", "Crop Production"),
                  Sector = c("Agribusiness", "Agribusiness", "Agribusiness"),
                  stringsAsFactors = FALSE)

#  Catcode Catname Catorder        Industry       Sector
#1   A1200   Sugar cane A01 Crop Production Agribusiness
#2   B1500   Salty cane A01 Crop Production Agribusiness
#3   C1800  Butter cane A01 Crop Production Agribusiness

df2 <- data.frame(BusinessName = c("Sarah Farms", "Ben Farms"),
                  AmountDonated = c(100, 200),
                  Year = c(2010, 2010),
                  Category = c("A1200", "B1500"),
                  stringsAsFactors = FALSE)

#  BusinessName AmountDonated Year Category
#1  Sarah Farms           100 2010    A1200
#2    Ben Farms           200 2010    B1500

library(dplyr)
library(data.table)

# 1) dplyr option
# Catcode C1800 will be dropped since it does not exist in both data frames.
inner_join(df1, df2, by = c("Catcode" = "Category"))

#      Catcode Catname Catorder        Industry       Sector BusinessName AmountDonated Year
#1   A1200   Sugar cane A01 Crop Production Agribusiness  Sarah Farms           100 2010
#2   B1500   Salty cane A01 Crop Production Agribusiness    Ben Farms           200 2010

# Catcide C1800 remains
left_join(df1, df2, by = c("Catcode" = "Category"))

#      Catcode Catname Catorder        Industry       Sector BusinessName AmountDonated Year
#1   A1200   Sugar cane A01 Crop Production Agribusiness  Sarah Farms           100 2010
#2   B1500   Salty cane A01 Crop Production Agribusiness    Ben Farms           200 2010
#3   C1800  Butter cane A01 Crop Production Agribusiness         <NA>            NA   NA

# 2) data.table option
# Convert data.frame to data.table
setDT(df1)
setDT(df2)

#Set columns for merge
setkey(df1, "Catcode")
setkey(df2, "Category")

df1[df2]

#   Catcode Catname Catorder        Industry       Sector BusinessName AmountDonated Year
#1:   A1200   Sugar cane A01 Crop Production Agribusiness  Sarah Farms           100 2010
#2:   B1500   Salty cane A01 Crop Production Agribusiness    Ben Farms           200 2010

df2[df1]
#   BusinessName AmountDonated Year Category Catname Catorder        Industry       Sector
#1:  Sarah Farms           100 2010    A1200   Sugar cane A01 Crop Production Agribusiness
#2:    Ben Farms           200 2010    B1500   Salty cane A01 Crop Production Agribusiness
#3:           NA            NA   NA    C1800  Butter cane A01 Crop Production Agribusiness

答案 1 :(得分:0)

我想你问的是如何查询..不是吗?

SELECT * 
FROM
code values dataset(your table for this) a
LEFT JOIN industry donation dataset(your table for this) b 
ON a.CatCode = b.Category

答案 2 :(得分:0)

正如krlmlr所说:

> merge(df1, df2, by.x = "Catcode", by.y = "Category", all = T)
  Catcode    Catname Catorder        Industry       Sector Business_name Amount_donated Year
1   A1200 Sugar_cane      A01 Crop_Production Agribusiness   Sarah_Farms           1000 2010

但是你应该避免列名和值中的空格。我用_

替换了它们