我有一个政治捐赠数据集,其中包含字母数字代码中的行业类别。单独的文本文档列出了这些字母数字代码如何转换为行业名称,行业名称和行业类别名称。
例如," A1200",是农业企业部门的甘蔗产业类作物。我想知道如何在单独的列中将字母数字代码与其各自的扇区,行业和类别值配对。
目前,代码值数据集位于
中 Catcode Catname Catorder Industry Sector
A1200 Sugar cane A01 Crop Production Agribusiness
以及此行业捐赠数据集:
Business name Amount donated Year Category
Sarah Farms 1000 2010 A1200
类别数据集约为444行,捐赠集约为1M行。我如何感受捐赠数据集,因此它看起来像这样。类别将是通用名称
Catcode Catname Catorder Industry Sector Business name Amount donated Year Category
A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms 1000 2010 A1200
我对这些论坛有点新意,所以如果有更好的方式来提出这个问题,请告诉我。谢谢你的帮助!
答案 0 :(得分:2)
如果速度很重要,您可能需要使用data.table
或dplyr
。在这里,我稍微修改了您的示例数据以提供一些想法。
df1 <- data.frame(Catcode = c("A1200", "B1500", "C1800"),
Catname = c("Sugar", "Salty", "Butter"),
Catorder = c("cane A01", "cane A01", "cane A01"),
Industry = c("Crop Production", "Crop Production", "Crop Production"),
Sector = c("Agribusiness", "Agribusiness", "Agribusiness"),
stringsAsFactors = FALSE)
# Catcode Catname Catorder Industry Sector
#1 A1200 Sugar cane A01 Crop Production Agribusiness
#2 B1500 Salty cane A01 Crop Production Agribusiness
#3 C1800 Butter cane A01 Crop Production Agribusiness
df2 <- data.frame(BusinessName = c("Sarah Farms", "Ben Farms"),
AmountDonated = c(100, 200),
Year = c(2010, 2010),
Category = c("A1200", "B1500"),
stringsAsFactors = FALSE)
# BusinessName AmountDonated Year Category
#1 Sarah Farms 100 2010 A1200
#2 Ben Farms 200 2010 B1500
library(dplyr)
library(data.table)
# 1) dplyr option
# Catcode C1800 will be dropped since it does not exist in both data frames.
inner_join(df1, df2, by = c("Catcode" = "Category"))
# Catcode Catname Catorder Industry Sector BusinessName AmountDonated Year
#1 A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms 100 2010
#2 B1500 Salty cane A01 Crop Production Agribusiness Ben Farms 200 2010
# Catcide C1800 remains
left_join(df1, df2, by = c("Catcode" = "Category"))
# Catcode Catname Catorder Industry Sector BusinessName AmountDonated Year
#1 A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms 100 2010
#2 B1500 Salty cane A01 Crop Production Agribusiness Ben Farms 200 2010
#3 C1800 Butter cane A01 Crop Production Agribusiness <NA> NA NA
# 2) data.table option
# Convert data.frame to data.table
setDT(df1)
setDT(df2)
#Set columns for merge
setkey(df1, "Catcode")
setkey(df2, "Category")
df1[df2]
# Catcode Catname Catorder Industry Sector BusinessName AmountDonated Year
#1: A1200 Sugar cane A01 Crop Production Agribusiness Sarah Farms 100 2010
#2: B1500 Salty cane A01 Crop Production Agribusiness Ben Farms 200 2010
df2[df1]
# BusinessName AmountDonated Year Category Catname Catorder Industry Sector
#1: Sarah Farms 100 2010 A1200 Sugar cane A01 Crop Production Agribusiness
#2: Ben Farms 200 2010 B1500 Salty cane A01 Crop Production Agribusiness
#3: NA NA NA C1800 Butter cane A01 Crop Production Agribusiness
答案 1 :(得分:0)
我想你问的是如何查询..不是吗?
SELECT *
FROM
code values dataset(your table for this) a
LEFT JOIN industry donation dataset(your table for this) b
ON a.CatCode = b.Category
答案 2 :(得分:0)
正如krlmlr所说:
> merge(df1, df2, by.x = "Catcode", by.y = "Category", all = T)
Catcode Catname Catorder Industry Sector Business_name Amount_donated Year
1 A1200 Sugar_cane A01 Crop_Production Agribusiness Sarah_Farms 1000 2010
但是你应该避免列名和值中的空格。我用_