R:创造&分配重复记录

时间:2013-08-03 06:14:50

标签: r duplicates uniqueidentifier

我有一系列媒体资源,我必须为其分配县名。对于只有一个县分配的某些来源(例如当地报纸),这很简单 - 我根据switch函数创建了一个县名变量,该函数根据源名称分配了县名。样品:

switchfun <- function(x) {switch(x, 'Morning Call' = 'Lehigh', 'Inquirer' =     
'Philadelphia', 'Daily Ledger' = 'Mercer', 'Null') }

County.Name <- as.character(lapply(Source, switchfun))

但我有源(NPR,AP等),我想分配给我的数据集中的所有县。这本质上是复制任何源为“国家”的记录,并将记录分配给我的数据集中的每个县。

dput当前文件布局:

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 1L, 6L
), .Label = c("Associated Press", "Daily Ledger", "Herald Tribune", 
"Inquirer", "Morning Call", "NPR", "Yahoo News"), class = "factor"), 
County = structure(c(1L, 2L, 4L, 3L, NA, NA, NA), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), 
Score = c(3L, 10L, 4L, 8L, 1L, 3L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -7L
))

在当前档案NPR,美联社,&amp;雅虎新闻没有相关的县(“NA”)。

dput所需的文件布局:

structure(list(Source = structure(c(5L, 2L, 4L, 3L, 7L, 7L, 7L, 
7L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L), .Label = c("Associated Press", 
"Daily Ledger", "Herald Tribune", "Inquirer", "Morning Call", 
"NPR", "Yahoo News"), class = "factor"), County = structure(c(1L, 
2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L, 1L, 2L, 4L, 3L), .Label = c("Lehigh", 
"Mercer", "Montgomery", "Philadelphia"), class = "factor"), Score = c(3L, 
10L, 4L, 8L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L)), .Names = c("Source", 
"County", "Score"), class = "data.frame", row.names = c(NA, -16L
))

在所需的布局中,我已经分配了每个国家来源和它对数据集中四个县中每个县的得分。例如雅虎新闻&amp;它的得分为1倍并且被复制了4次。与Lehigh,费城,蒙哥马利,&amp;美世县。雅虎新闻有“NA”郡的记录消失了。在我的实际数据集中,我有大约100个县,所以Yahoo News&amp;它的相关变量(例如分数,日期,作者等 - 我总共有大约60个变量)将被复制100次。我还希望将这些新“重复”记录的县分配到我使用上面的switch函数创建的County.Name变量中。我不想要2个县名字段,我希望所有这些新创建的县都在County.Names。

1 个答案:

答案 0 :(得分:1)

如果我理解正确,这可能是一种可能性:

# a (minimal) data frame with all unique source-county combinations
src_cnt <- data.frame(source = c("Morning Call", "AP", "AP", "AP"), county = c("Lehigh", "Lehigh", "Mercer", "Phila"))

# a data frame with a unique score for each source
src_score <- data.frame(source = c("Morning Call", "AP"), score = c(10, 3))

merge(src_cnt, src_score)
更新后的问题

修改

# Assuming your current data is named dd
# select the national sources, i.e. the sources where County is missing
src_national <- dd$Source[is.na(dd$County)])

# select unique counties
counties <- unique(dd$County[!is.na(dd$County)])

# create all combinations of national sources and counties
src_cnt <- expand.grid(Source = src_national, County = counties)

# add score from current data to national sources
src_cnt2 <- merge(src_cnt, dd[is.na(dd$County), c("Source", "Score")], by = "Source")

# add national sources to local sources in dd
dd2 <- rbind(dd[!is.na(dd$County), ], src_cnt2)

# order by Sourcy and County
# assuming desired data is named `desired`
library(plyr)
desired2 <- arrange(df = desired, Source, County) 
dd2 <- arrange(df = dd2, Source, County)
all.equal(desired2, dd2)

对于问题的最后部分,您可以rbind src_cnt County.Name中的dd2国家来源,或者从{{1}}中选择相关变量