使用提供的级别列表将字符向量转换为因子向量会创建NA

时间:2017-05-15 18:35:40

标签: r

我确信这是在SO中提出来的,但我还没有找到专门解决这个问题的问题。

我有以下字符向量:

chr_string <- c("NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND",
                "NEW.ENGLAND", "NEW.ENGLAND", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC",
                "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL",
                "E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL",
                "W..NOR..CENTRAL", "W..NOR..CENTRAL", "SOUTH.ATLANTIC",  "SOUTH.ATLANTIC",  "SOUTH.ATLANTIC", "SOUTH.ATLANTIC", 
                "E..SOU..CENTRAL", "E..SOU..CENTRAL", "E..SOU..CENTRAL", "W..SOU..CENTRAL", "W..SOU..CENTRAL", "MOUNTAIN")

我希望将其转换为一个因子向量,具有指定的级别列表,例如下面的内容(请注意,以下levels向量中并非所有级别都显示在上面的chr_string向量中) :

levels <- c("NEW ENGLAND", "MIDDLE ATLANTIC", "E. NOR. CENTRAL", "W. NOR. CENTRAL", "SOUTH ATLANTIC", "E. SOU. CENTRAL",
            "W. SOU. CENTRAL", "MOUNTAIN", "PACIFIC") 

不幸的是,当我尝试以下操作时,我的向量主要转向NA

factor(chr_string, levels = levels)
 [1] <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
 [13] <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>    
 [25] <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     <NA>     MOUNTAIN
 9 Levels: NEW ENGLAND MIDDLE ATLANTIC E. NOR. CENTRAL W. NOR. CENTRAL SOUTH ATLANTIC ... PACIFIC

我理解它创建NAs的原因是由于以下原因(来自?factor):

  

矢量的编码如下。首先,排除的所有值都从级别中删除。如果x [i]等于level [j],则结果的第i个元素是j。如果在级别中找不到匹配的x [i]匹配(对于排除的值将会发生),则结果的第i个元素将设置为NA。

但我该如何避免这种情况?

2 个答案:

答案 0 :(得分:1)

正如Greg所说,问题是你的字符串与levels不符。他们需要完全匹配。要应用此功能,您可以执行以下操作:

#starting with user specific data and levels
chr_string <- c("NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND",
                "NEW.ENGLAND", "NEW.ENGLAND", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC",
                "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL",
                "E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL",
                "W..NOR..CENTRAL", "W..NOR..CENTRAL", "SOUTH.ATLANTIC",  "SOUTH.ATLANTIC",  "SOUTH.ATLANTIC", "SOUTH.ATLANTIC", 
                "E..SOU..CENTRAL", "E..SOU..CENTRAL", "E..SOU..CENTRAL", "W..SOU..CENTRAL", "W..SOU..CENTRAL", "MOUNTAIN")
levels <- c("NEW ENGLAND", "MIDDLE ATLANTIC", "E. NOR. CENTRAL", "W. NOR. CENTRAL", "SOUTH ATLANTIC", "E. SOU. CENTRAL",
            "W. SOU. CENTRAL", "MOUNTAIN", "PACIFIC") 

#regex to remove periods from your vector of strings
chr_string <- sapply(chr_string, gsub, pattern = '[//.]', replacement = ' ')

#remove double spaces and replace with '. ' string as required by levels
chr_string <- sapply(chr_string, gsub, pattern = '  ', replacement = '. ')

#removing names from the vector
names(chr_string) <- NULL

#as requested; expected result
factor(chr_string, levels = levels)

或者,只需更改levels即可。

答案 1 :(得分:0)

指定的级别需要匹配字符串。你的第一个元素是“NEW.ENGLAND”,但是在级别中你有“NEW ENGLAND”(有空格而不是点),所以R不会匹配它们。在创建需要完全匹配的因子时,您可以使用labels参数在匹配后更改级别代码,或者您可以使用第二步并调用levels来更改标签。 / p>