将草率文本数据重新编码为R中的数字

时间:2014-12-13 20:26:44

标签: r string numeric summary

我正在尝试分析一个关于库引用交互的大型,草率,编码不良的数据文件。这是一组数据,用于捕捉我正在努力做的事情:

# assemble data
record<-c(2883823,2883824,2883825,2883826,2883828,2884074,2884076,2884660,2885106,2885222,2885703,2885709)
desk<-c("RRSS","RRSS","RRSS","RRSS","RRSS","RRSS","RRSS","Virt","RRSS","Virt","Virt","RRSS")
inperson<-c("InPerson<5Minutes",NA,NA,"InPerson<5Minutes",NA,NA,"InPerson<5Minutes",NA,"InPerson5-15Minutes",NA,NA,"InPerson15-30minutes")
phone<-c(NA,"Phone5-15Minutes","Phone<5Minutes",NA,NA,"Phone<5Minutes",NA,NA,NA,NA,NA,NA)
chat<-c(NA,NA,NA,NA,"Chat<5Minutes",NA,NA,"Chat5-15Minutes",NA,"Chat5-15Minutes","Chat15-30minutes",NA)

reference<-data.frame(record,desk,inperson,phone,chat) #create data frame

我想在人员,电话和聊天中变量的不同级别进行编码(为了清楚起见,可能使用新名称,我在下面使用前缀Num来表示这一点)字符串为数字。我认为这将是某种if-then语句(但是因为输入数据中使用的语言用不同的语言编码,每个变量都是不同的):

record  desk    Numperson   Numphone    Numchat  
2883823 RRSS    1           0           0
2883824 RRSS    0           2           0
2883825 RRSS    0           1           0
2883826 RRSS    1           0           0
2883828 RRSS    0           0           1
2884074 RRSS    0           1           0
2884076 RRSS    1           0           0
2884660 Virt    0           0           2
2885106 RRSS    2           0           0
2885222 Virt    0           0           2
2885703 Virt    0           0           3
2885709 RRSS    3           0           0

然后重新排列它以便更适合分析,如下所示:

record  desk    type    Numlevel  
2883823 RRSS    person  1  
2883824 RRSS    phone   2  
2883825 RRSS    phone   1  
2883826 RRSS    person  1  
2883828 RRSS    chat    1  
2884074 RRSS    phone   1  
2884076 RRSS    person  1  
2884660 Virt    chat    2  
2885106 RRSS    person  2  
2885222 Virt    chat    2  
2885703 Virt    chat    3  
2885709 RRSS    person  3  

任何帮助,或指向我应该看的地方的指针,作为初学者,对于答案将不胜感激。

1 个答案:

答案 0 :(得分:3)

也许是这样的:

#clean up
reference$inperson <- gsub("InPerson|[Mm]inutes", "", reference$inperson)
reference$phone <- gsub("Phone|[Mm]inutes", "", reference$phone)
reference$chat <- gsub("Chat|[Mm]inutes", "", reference$chat)

#reshape to long format
library(reshape2)
reference <- melt(reference, id.vars = c("record", "desk"), 
                  variable.name = "type", value.name = "Numlevel",
                  na.rm = TRUE)

#match
reference$Numlevel <- match(reference$Numlevel, c("<5", "5-15", "15-30"))

#    record desk     type Numlevel
#1  2883823 RRSS inperson        1
#4  2883826 RRSS inperson        1
#7  2884076 RRSS inperson        1
#9  2885106 RRSS inperson        2
#12 2885709 RRSS inperson        3
#14 2883824 RRSS    phone        2
#15 2883825 RRSS    phone        1
#18 2884074 RRSS    phone        1
#29 2883828 RRSS     chat        1
#32 2884660 Virt     chat        2
#34 2885222 Virt     chat        2
#35 2885703 Virt     chat        3