Question

我正在尝试通过因子变量$ area__rucc将我的数据框（火车）分为两类：都市和非都市。该数据框干净，具有34个变量和2811个观察值。
```
 glimpse(train$area__rucc)
```
具有9个级别的因子“都市-人口在100万以上的都市县”，..：3 3 1 6 7 8 6 2 7 5 ...

前三个级别代表地铁，后六个级别代表非地铁

-首先我尝试通过地铁进行子集...

metro <- subset(train, area__rucc == c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))

这似乎按预期工作，并返回了387个观测值的df。

-接下来，我尝试按这样的非地铁级别进行细分...

not_metro <- subset(train, area__rucc != c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))

这返回了2811个观测值，但是在进一步检查后，df包含地铁级别和非地铁级别。显然没有按我的预期工作。

-我的第三枪...

non_metro <- subset(train, area__rucc == c("Nonmetro - Completely rural or less than 2,500 urban population, adjacent to a metro area", 
                "Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area", 
                "Nonmetro - Urban population of 2,500 to 19,999, adjacent to a metro area", 
                "Nonmetro - Urban population of 2,500 to 19,999, not adjacent to a metro area", 
                "Nonmetro - Urban population of 20,000 or more, adjacent to a metro area", 
                "Nonmetro - Urban population of 20,000 or more, not adjacent to a metro area"))

在这里，我明确列出了非地铁级别（4：9）。这返回了具有354个观测值的df，所有观测值都不是大范围的。

387（地铁）+ 354（非地铁）！= 3189 在train $ area_rucc中没有丢失的值，因此我尝试从train创建的两个df应该具有与原始df相同的观测值，对吗？

我感觉到我犯了一个愚蠢的错误，我现在似乎无法抓住（可能是经验不足），或者我在这里试图做的事情完全不合时宜，但这是开始让我感到沮丧，任何见识将不胜感激。

Answer 1

我不确定您要获得什么样的最终结果，我想这样的事情应该很整洁：

    train %>%
        mutate(metro = ifelse(area__rucc=="Metro - Counties in metro areas of 1 million population or more"|area__rucc=="Metro - Counties in metro areas of 250,000 to 1 million population",area__rucc("Metro - Counties in metro areas of fewer than 250,000 population",1,0) %>%
        group_by(metro)

Answer 2

==进行元素（行）比较-您想使用%in%

在获得您的代码之前，让我们做一个简单的例子

x = 1:6
y = c(1, 3)
x == y
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE

请注意，即使TRUE中的1和3都只有一个1:6。那是因为比较是这样发生的：

data.frame(x, y, "x==y" = x == y, check.names = FALSE)
#   x y  x==y
# 1 1 1  TRUE   # 1 does equal 1
# 2 2 3 FALSE   # 2 does not equal 3
# 3 3 1 FALSE   # 3 does not equal 1
# 4 4 3 FALSE   # 4 does not equal 3
# 5 5 1 FALSE   # 5 does not equal 1
# 6 6 3 FALSE   # 6 does not equal 3

x == y将x的第一个元素与y的第一个元素进行比较，将x的第二个元素与y的第二个元素进行检查，依此类推。 x或y中的y = c(1, 3)较短时，它将被“回收”，就像您在上方的数据框中看到的那样，输入1 3 1 3 1 3在数据框中变为%in%。

相反，请使用x %in% y # [1] TRUE FALSE TRUE FALSE FALSE FALSE：

x %in% y

x将y的每个元素与c(1, 3)的所有元素进行比较。现在我们得到两个TRUE值，因为1和3都在 in metro <- subset(train, area__rucc %in% c( "Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population" ) )

中

适用于您的问题：

! x %in% y

您可以将其取反为not_metro <- subset(train, !area__rucc %in% c( "Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population" ) )，所以

ng2-search-filter

Answer 3

在不深入了解数据框的情况下，我认为以下玩具示例可能会对您有所帮助。

alphab <- data.frame(letters = c("A","T", "U", "Z"))
alphab

consonants <- subset(alphab, letters %in% c("T", "Z"))
consonants

vowels <- subset(alphab, !(letters %in% c("T","Z")))
vowels

在使用R中的因子变量来替换数据帧时遇到问题

3 个答案: