我想这会有一个非常简单的答案。但是这里有。
长格式的数据。像这样
d <- data.frame(cbind(numbers = rnorm(10),
year = rep(c(2008, 2009), 5),
name = c("john", "David", "Tom", "Kristin", "Lisa","Eve","David","Tom","Kristin","Lisa")))
如何仅使用2008年和2009年出现的名称行获取新数据框? (即只有大卫,克里斯汀,丽莎和汤姆)。
提前致谢
答案 0 :(得分:11)
简单方法:
subset(
d,
name %in% intersect(name[year==2008], name[year==2009])
)
答案 1 :(得分:3)
一种方法是使用reshape包创建一个data.frame,其中包含列数和行数的年份:
library(reshape)
cast(d, name ~ year, value = "numbers")
然后,您可以使用complete.cases
来提取感兴趣的行。
答案 2 :(得分:2)
如果每年只有一条记录,只需计算每个人在数据集中出现的次数:
counts <- as.data.frame(table(name = d$name))
然后寻找两次出现的人:
subset(counts, Freq == 2)
答案 3 :(得分:1)
这是另一个仅使用基数R的解决方案,并且不对每个人每年记录的数量做出任何假设:
d <- data.frame(cbind(numbers = rnorm(10),
year = rep(c(2008, 2009), 5),
name = c("john", "David", "Tom", "Kristin",
"Lisa","Eve","David","Tom","Kristin",
"Lisa")))
# split data into 2 data.frames (1 for each year)
by.year <- split(d, d$year, drop=T)
# find the names that appear in both years
keep <- intersect(by.year[['2008']]$name, by.year[['2009']]$name)
# Or, if you had several years, use Reduce as a more general solution:
keep <- Reduce(intersect, lapply(by.year, '[[', 'name'))
# show the rows of the original dataset only if their $name field
# is in our 'keep' vector
d[d$name %in% keep,]