我经常处理我的工作中的调查数据,这些调查数据来自可怕的格式化excel文件,这些文件专为可读性而设计,而不是用于任何数据分析。我正在寻找一种清理R中数据的方法,并将其转换为变量和观察的数据帧格式。
我知道有很多关于R中数据清理的教程,但根据我的经验,他们主要处理的是已经采用机器可读格式的数据,所以对此有任何帮助都会受到赞赏!
以下是具有这种形状的原始调查的虚拟示例:
Are you male or female?
Variable1 Variable2 Variable3 Variable4
Male n% n% n% n%
Female n% n% n% n%
How old are you?
Variable1 Variable2 Variable3 Variable4
18-34 n% n% n% n%
35+ n% n% n% n%
依此类推,如果空白区域为空单元格/行,则每个调查问题的整个位于A列上方几行,相应的数据表和所有问题/数据表位于一个工作表上。
有没有办法用R代码转换成这个?
Question Response Variable1 Variable2 Variable3 Variable4
Are you male or female? Male n% n% n% n%
Are you male or female? Female n% n% n% n%
How old are you? 18-34 n% n% n% n%
How old are you? 35+ n% n% n% n%
目前,我正在使用一些VBA代码在excel中执行此操作,然后读入R进行进一步分析/可视化,但是能够跳过excel阶段并直接转到R.
谢谢!
答案 0 :(得分:1)
这是处理严重整理数据的粗略方法。我用csv格式编写了一个并将其托管在一个杂项回购中:
file <- "https://raw.githubusercontent.com/minerva79/woodpecker/master/data/example.csv"
survey <- readLines(file)
(1)剥去所有白线:
white.lines <- nchar(gsub(",", "", survey))==0
survey <- survey[!white.lines]
[1] "Are you male or female?,,,," ",Variable1,Variable2,Variable3,Variable4" "Male,0.5,0.6,0.7,0.8"
[4] "Female,0.5,0.4,0.3,0.2" "How old are you?,,,," ",Variable1,Variable2,Variable3,Variable4"
[7] "18-34,0.4,0.5,0.7,0.1" "35+,0.6,0.5,0.3,0.9"
(2)识别标题位置
headers <- substring(survey, 1,1) == ","
survey[headers]
[1] ",Variable1,Variable2,Variable3,Variable4" ",Variable1,Variable2,Variable3,Variable4"
(3)根据标题位置
找到问题位置header_pos <- (1:length(survey))[headers]
qn_pos <- header_pos - 1
qn <- survey[qn_pos] %>% gsub(",", "", .)
qn
[1] "Are you male or female?" "How old are you?"
(4)确定表格的行(从header_pos
到qn_pos-1
或length(survey)
:
qn_pos <- c(qn_pos - 1, length(survey))
tabs <- lapply(1:length(qn), function(x)survey[header_pos[x]:qn_pos[x+1]])
tabs
[[1]]
[1] ",Variable1,Variable2,Variable3,Variable4" "Male,0.5,0.6,0.7,0.8" "Female,0.5,0.4,0.3,0.2"
[[2]]
[1] ",Variable1,Variable2,Variable3,Variable4" "18-34,0.4,0.5,0.7,0.1" "35+,0.6,0.5,0.3,0.9"
(5)将每个列表对象读为表:
tabs <- lapply(tabs, function(x)read.table(text=x, sep=",", header=T, row.names=1))
tabs
[[1]]
Variable1 Variable2 Variable3 Variable4
Male 0.5 0.6 0.7 0.8
Female 0.5 0.4 0.3 0.2
[[2]]
Variable1 Variable2 Variable3 Variable4
18-34 0.4 0.5 0.7 0.1
35+ 0.6 0.5 0.3 0.9
(6)改变问题和反应,以及rbind:
tabs <- lapply(1:length(tabs), function(x) tabs[[x]] %>% mutate(Question= qn[x], Response=row.names(.)))
do.call(rbind, tabs)
Variable1 Variable2 Variable3 Variable4 Question Response
1 0.5 0.6 0.7 0.8 Are you male or female? Male
2 0.5 0.4 0.3 0.2 Are you male or female? Female
3 0.4 0.5 0.7 0.1 How old are you? 18-34
4 0.6 0.5 0.3 0.9 How old are you? 35+
== 编辑:由于之前的问题不明确,我推了下面的旧答案。
假设您有2个调查问题如下:
set.seed(4)
sq_1 <- data.frame(V1 = rnorm(2, .5, .1), V2 = rnorm(2, .5, .1),V3 = rnorm(2, .5, .1),V4 = rnorm(2, .5, .1), row.names=paste0("response",1:2))
sq_2 <- data.frame(V1 = rnorm(2, .5, .1), V2 = rnorm(2, .5, .1),V3 = rnorm(2, .5, .1),V4 = rnorm(2, .5, .1), row.names=paste0("response",1:2))
write.csv(sq_1, "survey_question_1.csv")
write.csv(sq_2, "survey_question_2.csv")
将它们作为列表读入R:
files <- list.files(pattern="\\.csv")
survey <- lapply(files, read.csv, header=T, row.names=1)
使用dplyr插入问题和响应列:
library(dplyr)
survey <- lapply(1:length(survey), function(x) survey[[x]] %>%
mutate(Question=paste0("Q",x), Response = rownames(.)))
do.call(rbind, survey)
V1 V2 V3 V4 Question Response
1 0.5216755 0.5891145 0.6635618 0.3718753 Q1 response1
2 0.4457507 0.5595981 0.5689275 0.4786855 Q1 response2
3 0.6896540 0.5566604 0.5383057 0.5034352 Q2 response1
4 0.6776863 0.5015719 0.4954863 0.5169027 Q2 response2