Question

我有一个包含四个字段的CSV文件：

ID, choice1, choice2, choice3

三个选项字段是表示分类变量的字符串，其中每种情况下域都相同。

我想在使用R时阅读此内容，并将这些作为因素，而不是字符串，但在列中使用一致的级别值。所以＆＃39; foo＆＃39;在choice1列中具有与＆＃39; foo＆＃39;相同的值。在choice2专栏等

如何确保所有选择列的分类类型相同？

Answer 1

您似乎需要以字符为单位的列并将其转换为自己的因素：

data = read.csv(file, stringsAsFactors = FALSE)
levels = with(data, unique(c(choice1, choice2, choice3)))
data = within(data, {
    choice1 = factor(choice1, levels),
    choice2 = factor(choice2, levels),
    choice3 = factor(choice3, levels),
})

read.csv有一个colClasses参数但是需要字符类名称，所以很遗憾这里没用。

Answer 2

这是一个可能的解决方案，允许您预先计算＆＃34;公共因素＆＃34;列索引并避免后续重复（使用我自己的随机数据）：

system('cat data.csv');
## ID,choice1,choice2,choice3
## 1,A,B,E
## 2,A,C,D
## 3,B,B,A
## 4,B,D,E
raw <- read.csv('data.csv',stringsAsFactors=F);
fcols <- grep('^choice\\d$',names(raw));
levels <- unique(do.call(c,raw[,fcols,drop=F]));
dat <- data.frame(c(raw[,-fcols,drop=F],lapply(raw[,fcols,drop=F],factor,levels)));
dat;
##   ID choice1 choice2 choice3
## 1  1       A       B       E
## 2  2       A       C       D
## 3  3       B       B       A
## 4  4       B       D       E
lapply(dat,levels);
## $ID
## NULL
## 
## $choice1
## [1] "A" "B" "C" "D" "E"
## 
## $choice2
## [1] "A" "B" "C" "D" "E"
## 
## $choice3
## [1] "A" "B" "C" "D" "E"

多列的一致因子值

2 个答案: