我是R的新手,在这个精美的网站上通过查看其他问题已经学到了很多东西!
但是现在我正处理一个我无法从其他示例中找到的数据管理问题,所以我希望你能提供帮助。
我有一组调查回复,我已经从csv文件中读到了这些回复,并且被格式化为一个格式,如下例所示:
test <- c(
"[1234],Bob Smith,",
"Q-0,Male",
"Q-1,18-25",
"Q-2,Computer Science",
",",
"[5678],Julie Lewis",
"Q-0,Female",
"Q-1,18-25",
",",
","
)
请注意,","
会出现在自己的行上,因为我在fill=TRUE
中使用了read.csv
来处理并非所有行都具有相同的长度。另请注意,并非所有受访者都回答过所有问题。
我需要将其转换为以下结构的数据框:
ID name gender age major
1 [1234] Bob Smith Male 18-25 Computer Science
2 [5678] Julie Lewis Female 18-25 NA
...
似乎我无法按行将数据读入矩阵或数据框,因为并非所有受访者都回答了所有问题。关于如何处理这个问题的任何建议?
答案 0 :(得分:2)
首先,您可能会以正确的格式阅读csv文件,从而节省很多麻烦。 read.csv
是一个功能强大的功能,应该能够处理您的数据,而且这种情况不应该是必要的。
然而,这里是:
x <- matrix(test, byrow=TRUE, ncol=5)
x <- x <- sub("Q-\\w+,", "", x)
x[x==","] <- NA
x <- cbind(matrix(unlist(strsplit(x[, 1], ",")), byrow=TRUE, ncol=2), x[, -1])
x <- as.data.frame(x, stringsAsFactors=FALSE)
names(x) <- c("ID", "Name", "Gender", "Age", "Major", "V1")
这导致:
x
ID Name Gender Age Major V1
1 [1234] Bob Smith Male 18-25 Computer Science <NA>
2 [5678] Julie Lewis Female 18-25 <NA> <NA>
答案 1 :(得分:0)
这有点笨重,但它确实有效。
以下是数据:
test <- c(
"[1234],Bob Smith,",
"Q-0,Male",
"Q-1,18-25",
"Q-2,Computer Science",
",",
"[5678],Julie Lewis",
"Q-0,Female",
"Q-1,18-25",
",",
"[1234],Bob Smith,",
"Q-1,18-25",
"Q-2,Computer Science",
","
)
这是操作代码:
#remove rows with just a comma
test <- test[test!=","]
#find id cases and remove the commas between the id and the name
#and add an id label
idcases <- grep("\\[.*\\]",test)
test[idcases] <- paste("id,",gsub(",","",test[idcases]),sep="")
#find id values positions and end position
idvals <- c(idcases,length(test)+1)
#generate a sequence identifier for each respondent
setid <- rep(1:(length(idvals)-1),diff(idvals))
#put the set id against each value
result1 <- paste(setid,test,sep=",")
#split the strings up and make them a data.frame
result2 <- data.frame(do.call(rbind,strsplit(result1,",")))
#get the final dataset with a reshape
final <- reshape(result2,idvar="X1",timevar="X2",direction="wide")[,-1]
#clean up the names etc
names(final) <- c("name","gender","age","major")
final$id <- gsub("(\\[.*\\])(.*)","\\1",final$name)
final$name <- gsub("(\\[.*\\])(.*)","\\2",final$name)
给出了:
> final
name gender age major id
1 Bob Smith Male 18-25 Computer Science [1234]
5 Julie Lewis Female 18-25 <NA> [5678]
8 Bob Smith <NA> 18-25 Computer Science [1234]