我确信这是在SO中提出来的,但我还没有找到专门解决这个问题的问题。
我有以下字符向量:
chr_string <- c("NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND",
"NEW.ENGLAND", "NEW.ENGLAND", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC",
"MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL",
"E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL",
"W..NOR..CENTRAL", "W..NOR..CENTRAL", "SOUTH.ATLANTIC", "SOUTH.ATLANTIC", "SOUTH.ATLANTIC", "SOUTH.ATLANTIC",
"E..SOU..CENTRAL", "E..SOU..CENTRAL", "E..SOU..CENTRAL", "W..SOU..CENTRAL", "W..SOU..CENTRAL", "MOUNTAIN")
我希望将其转换为一个因子向量,具有指定的级别列表,例如下面的内容(请注意,以下levels
向量中并非所有级别都显示在上面的chr_string
向量中) :
levels <- c("NEW ENGLAND", "MIDDLE ATLANTIC", "E. NOR. CENTRAL", "W. NOR. CENTRAL", "SOUTH ATLANTIC", "E. SOU. CENTRAL",
"W. SOU. CENTRAL", "MOUNTAIN", "PACIFIC")
不幸的是,当我尝试以下操作时,我的向量主要转向NA
:
factor(chr_string, levels = levels)
[1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
[13] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
[25] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> MOUNTAIN
9 Levels: NEW ENGLAND MIDDLE ATLANTIC E. NOR. CENTRAL W. NOR. CENTRAL SOUTH ATLANTIC ... PACIFIC
我理解它创建NAs
的原因是由于以下原因(来自?factor
):
矢量的编码如下。首先,排除的所有值都从级别中删除。如果x [i]等于level [j],则结果的第i个元素是j。如果在级别中找不到匹配的x [i]匹配(对于排除的值将会发生),则结果的第i个元素将设置为NA。
但我该如何避免这种情况?
答案 0 :(得分:1)
正如Greg所说,问题是你的字符串与levels
不符。他们需要完全匹配。要应用此功能,您可以执行以下操作:
#starting with user specific data and levels
chr_string <- c("NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND", "NEW.ENGLAND",
"NEW.ENGLAND", "NEW.ENGLAND", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC",
"MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "MIDDLE.ATLANTIC", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL",
"E..NOR..CENTRAL", "E..NOR..CENTRAL", "E..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL", "W..NOR..CENTRAL",
"W..NOR..CENTRAL", "W..NOR..CENTRAL", "SOUTH.ATLANTIC", "SOUTH.ATLANTIC", "SOUTH.ATLANTIC", "SOUTH.ATLANTIC",
"E..SOU..CENTRAL", "E..SOU..CENTRAL", "E..SOU..CENTRAL", "W..SOU..CENTRAL", "W..SOU..CENTRAL", "MOUNTAIN")
levels <- c("NEW ENGLAND", "MIDDLE ATLANTIC", "E. NOR. CENTRAL", "W. NOR. CENTRAL", "SOUTH ATLANTIC", "E. SOU. CENTRAL",
"W. SOU. CENTRAL", "MOUNTAIN", "PACIFIC")
#regex to remove periods from your vector of strings
chr_string <- sapply(chr_string, gsub, pattern = '[//.]', replacement = ' ')
#remove double spaces and replace with '. ' string as required by levels
chr_string <- sapply(chr_string, gsub, pattern = ' ', replacement = '. ')
#removing names from the vector
names(chr_string) <- NULL
#as requested; expected result
factor(chr_string, levels = levels)
或者,只需更改levels
即可。
答案 1 :(得分:0)
指定的级别需要匹配字符串。你的第一个元素是“NEW.ENGLAND”,但是在级别中你有“NEW ENGLAND”(有空格而不是点),所以R不会匹配它们。在创建需要完全匹配的因子时,您可以使用labels
参数在匹配后更改级别代码,或者您可以使用第二步并调用levels
来更改标签。 / p>