在使用R中的因子变量来替换数据帧时遇到问题

时间:2018-07-16 19:56:14

标签: r

  • 我正在尝试通过因子变量$ area__rucc将我的数据框(火车)分为两类:都市和非都市。该数据框干净,具有34个变量和2811个观察值。

     glimpse(train$area__rucc)
    

    具有9个级别的因子“都市-人口在100万以上的都市县”,..:3 3 1 6 7 8 6 2 7 5 ...

前三个级别代表地铁,后六个级别代表非地铁

-首先我尝试通过地铁进行子集...

metro <- subset(train, area__rucc == c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))

这似乎按预期工作,并返回了387个观测值的df。

-接下来,我尝试按这样的非地铁级别进行细分...

not_metro <- subset(train, area__rucc != c("Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population"))

这返回了2811个观测值,但是在进一步检查后,df包含地铁级别和非地铁级别。显然没有按我的预期工作。

-我的第三枪...

non_metro <- subset(train, area__rucc == c("Nonmetro - Completely rural or less than 2,500 urban population, adjacent to a metro area", 
                "Nonmetro - Completely rural or less than 2,500 urban population, not adjacent to a metro area", 
                "Nonmetro - Urban population of 2,500 to 19,999, adjacent to a metro area", 
                "Nonmetro - Urban population of 2,500 to 19,999, not adjacent to a metro area", 
                "Nonmetro - Urban population of 20,000 or more, adjacent to a metro area", 
                "Nonmetro - Urban population of 20,000 or more, not adjacent to a metro area"))

在这里,我明确列出了非地铁级别(4:9)。这返回了具有354个观测值的df,所有观测值都不是大范围的。

387(地铁)+ 354(非地铁)!= 3189 在train $ area_rucc中没有丢失的值,因此我尝试从train创建的两个df应该具有与原始df相同的观测值,对吗?

我感觉到我犯了一个愚蠢的错误,我现在似乎无法抓住(可能是经验不足),或者我在这里试图做的事情完全不合时宜,但这是开始让我感到沮丧,任何见识将不胜感激。

3 个答案:

答案 0 :(得分:0)

我不确定您要获得什么样的最终结果,我想这样的事情应该很整洁:

    train %>%
        mutate(metro = ifelse(area__rucc=="Metro - Counties in metro areas of 1 million population or more"|area__rucc=="Metro - Counties in metro areas of 250,000 to 1 million population",area__rucc("Metro - Counties in metro areas of fewer than 250,000 population",1,0) %>%
        group_by(metro)

答案 1 :(得分:0)

==进行元素(行)比较-您想使用%in%

在获得您的代码之前,让我们做一个简单的例子

x = 1:6
y = c(1, 3)
x == y
# [1]  TRUE FALSE FALSE FALSE FALSE FALSE

请注意,即使TRUE中的1和3都只有一个1:6。那是因为比较是这样发生的:

data.frame(x, y, "x==y" = x == y, check.names = FALSE)
#   x y  x==y
# 1 1 1  TRUE   # 1 does equal 1
# 2 2 3 FALSE   # 2 does not equal 3
# 3 3 1 FALSE   # 3 does not equal 1
# 4 4 3 FALSE   # 4 does not equal 3
# 5 5 1 FALSE   # 5 does not equal 1
# 6 6 3 FALSE   # 6 does not equal 3

x == yx的第一个元素与y的第一个元素进行比较,将x的第二个元素与y的第二个元素进行检查,依此类推。 xy中的y = c(1, 3)较短时,它将被“回收”,就像您在上方的数据框中看到的那样,输入1 3 1 3 1 3在数据框中变为%in%

相反,请使用x %in% y # [1] TRUE FALSE TRUE FALSE FALSE FALSE

x %in% y

xy的每个元素与c(1, 3)的所有元素进行比较。现在我们得到两个TRUE值,因为1和3都在 in metro <- subset(train, area__rucc %in% c( "Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population" ) )


适用于您的问题:

! x %in% y

您可以将其取反为not_metro <- subset(train, !area__rucc %in% c( "Metro - Counties in metro areas of 1 million population or more", "Metro - Counties in metro areas of 250,000 to 1 million population", "Metro - Counties in metro areas of fewer than 250,000 population" ) ) ,所以

ng2-search-filter

答案 2 :(得分:0)

在不深入了解数据框的情况下,我认为以下玩具示例可能会对您有所帮助。

alphab <- data.frame(letters = c("A","T", "U", "Z"))
alphab

consonants <- subset(alphab, letters %in% c("T", "Z"))
consonants

vowels <- subset(alphab, !(letters %in% c("T","Z")))
vowels