我正在尝试通过因子变量$ area__rucc将我的数据框(火车)分为两类:都市和非都市。该数据框干净,具有34个变量和2811个观察值。
glimpse(train$area__rucc)
具有9个级别的因子“都市-人口在100万以上的都市县”,..:3 3 1 6 7 8 6 2 7 5 ...
前三个级别代表地铁,后六个级别代表非地铁
-首先我尝试通过地铁进行子集...
metro <- subset(train, area__rucc == c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))
这似乎按预期工作,并返回了387个观测值的df。
-接下来,我尝试按这样的非地铁级别进行细分...
not_metro <- subset(train, area__rucc != c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))
这返回了2811个观测值,但是在进一步检查后,df包含地铁级别和非地铁级别。显然没有按我的预期工作。
-我的第三枪...
non_metro <- subset(train, area__rucc == c("Nonmetro - Completely rural or less than 2,500 urban population, adjacent to a metro area",
"Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area",
"Nonmetro - Urban population of 2,500 to 19,999, adjacent to a metro area",
"Nonmetro - Urban population of 2,500 to 19,999, not adjacent to a metro area",
"Nonmetro - Urban population of 20,000 or more, adjacent to a metro area",
"Nonmetro - Urban population of 20,000 or more, not adjacent to a metro area"))
在这里,我明确列出了非地铁级别(4:9)。这返回了具有354个观测值的df,所有观测值都不是大范围的。
387(地铁)+ 354(非地铁)!= 3189 在train $ area_rucc中没有丢失的值,因此我尝试从train创建的两个df应该具有与原始df相同的观测值,对吗?
我感觉到我犯了一个愚蠢的错误,我现在似乎无法抓住(可能是经验不足),或者我在这里试图做的事情完全不合时宜,但这是开始让我感到沮丧,任何见识将不胜感激。
答案 0 :(得分:0)
我不确定您要获得什么样的最终结果,我想这样的事情应该很整洁:
train %>%
mutate(metro = ifelse(area__rucc=="Metro - Counties in metro areas of 1 million population or more"|area__rucc=="Metro - Counties in metro areas of 250,000 to 1 million population",area__rucc("Metro - Counties in metro areas of fewer than 250,000 population",1,0) %>%
group_by(metro)
答案 1 :(得分:0)
==
进行元素(行)比较-您想使用%in%
在获得您的代码之前,让我们做一个简单的例子
x = 1:6
y = c(1, 3)
x == y
# [1] TRUE FALSE FALSE FALSE FALSE FALSE
请注意,即使TRUE
中的1和3都只有一个1:6
。那是因为比较是这样发生的:
data.frame(x, y, "x==y" = x == y, check.names = FALSE)
# x y x==y
# 1 1 1 TRUE # 1 does equal 1
# 2 2 3 FALSE # 2 does not equal 3
# 3 3 1 FALSE # 3 does not equal 1
# 4 4 3 FALSE # 4 does not equal 3
# 5 5 1 FALSE # 5 does not equal 1
# 6 6 3 FALSE # 6 does not equal 3
x == y
将x
的第一个元素与y
的第一个元素进行比较,将x
的第二个元素与y
的第二个元素进行检查,依此类推。 x
或y
中的y = c(1, 3)
较短时,它将被“回收”,就像您在上方的数据框中看到的那样,输入1 3 1 3 1 3
在数据框中变为%in%
。
相反,请使用x %in% y
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
:
x %in% y
x
将y
的每个元素与c(1, 3)
的所有元素进行比较。现在我们得到两个TRUE值,因为1和3都在 in metro <- subset(train,
area__rucc %in% c(
"Metro - Counties in metro areas of 1 million population or more",
"Metro - Counties in metro areas of 250,000 to 1 million population",
"Metro - Counties in metro areas of fewer than 250,000 population"
)
)
适用于您的问题:
! x %in% y
您可以将其取反为not_metro <- subset(train,
!area__rucc %in% c(
"Metro - Counties in metro areas of 1 million population or more",
"Metro - Counties in metro areas of 250,000 to 1 million population",
"Metro - Counties in metro areas of fewer than 250,000 population"
)
)
,所以
ng2-search-filter
答案 2 :(得分:0)
在不深入了解数据框的情况下,我认为以下玩具示例可能会对您有所帮助。
alphab <- data.frame(letters = c("A","T", "U", "Z"))
alphab
consonants <- subset(alphab, letters %in% c("T", "Z"))
consonants
vowels <- subset(alphab, !(letters %in% c("T","Z")))
vowels