Question

我正在与data.frame ("hi")合作，概述美国各县的健康保险计划。对于每个计划，有一行包含State和County列以及有关计划本身的信息（保费，免赔额等）。

作为我的分析的一部分，我想将此data.frame ("hi")与另一个包含每个County的人口统计信息的data.frame（我们称之为“人口普查”）联系起来。我曾计划match()使用两个data.tables 和FIPS ID（地理联邦标识符）之间共享的County个名称。

在我继续进行第二步（match()）之前，我需要检查各州的“普通”县名 - 即爱荷华州和北达科他州（以及内布拉斯加州）都有一个苏族县事实证明）。如果我找不到解决方法，我可能会错误地match() FIPS ID和“人口普查”信息。

sioux <- hi[hi$County == "Sioux",] sioux[26:31,1:3] State County Metal.Level 15407 IA Sioux Platinum 15408 IA Sioux Catastrophic 15409 IA Sioux Silver 46129 ND Sioux Silver 46130 ND Sioux Silver 46131 ND Sioux Gold

似乎unique()会运行良好，但鉴于County和State位于不同的列中，不确定如何指定我正在寻找具有相同名称的县在不同的州。

Answer 1

除了评论中提供的建议外，还可以使用以下代码制作唯一州，县对的数据框。

library(dplyr)
sioux %>% distinct(State, County)

如果您想要所有县的列表而不仅仅是一个县，您可以执行以下操作。

#creates a data frame with two county names "Sioux" and "Countyx"
counties <- structure(list(State = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("IA", "ND"), class = "factor"), County = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L), .Label = c("  Sioux", " Countyx"), class = "factor"), 
    Metal.Level = structure(c(4L, 5L, 2L, 2L, 2L, 1L, 3L), .Label = c("         Gold", 
    "       Silver", "      Silver", "     Platinum", " Catastrophic"
    ), class = "factor")), .Names = c("State", "County", "Metal.Level"
), class = "data.frame", row.names = c(NA, -7L))


#Find the distinct State Country pairs, then filter out all Country names that only appear 1.

counties %>% distinct(State, County) %>% group_by(County) %>% 
  filter(n()>1)

识别由第二个变量区分的重复值

1 个答案: