我有一个作为SAS文件导入的调查数据集,但是它不包括与数据集中的数字代码关联的文本标签。
我正在尝试将因子函数应用于所有变量,然后为每个变量分别设置级别和标签。
我有一个包含实际数据的主数据框,然后是一个具有文本标签的第二个数据框,这些文本标签对应于每个变量的每个值。
因此,例如,主数据集中的变量列名称为A1,B1,C1,D1。下面列出了带有标签的第二个数据框,其中包含伪文本。而且对于每个变量,需要文本标签的值数量有所不同。
labels_list <- structure(list(VariableName = c("A1", "A1", "A1", "B1", "B1",
"B1", "B1", "C1", "C1", "C1", "C1", "C1", "D1", "D1", "D1", "D1",
"D1", "D1"), Value = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = c("Red", "Blue", "Yellow",
"Up", "Down", "Left", "Right", "Boston", "Atlanta", "Dallas",
"New York", "Los Angeles", "John", "Jim", "Jake", "Bill", "Bob",
"Brian")), class = "data.frame", row.names = c(NA, -18L))
我正在尝试编写一个函数来自动标记所有因子变量。该函数缩减数据以确保它们每个都包含完全相同的变量,然后以完全相同的顺序排列。我使用split函数将上面的表分为一个列表,然后上面的每个变量名都有它自己的列表,但是当我尝试在for循环中将列表子集化时遇到错误。
下面是我写的for循环。
df =主要数据集
labels_list =带有值和文本标签的列表
for(i in 1:ncol(df)) {
for(j in labels_list) {
if(names(x[,i]) == names(ahs_split[[j]])) {
x[,i] <- factor(x[,i], levels = c(ahs_split[[j]][[2]]), labels = c(ahs_split[[j]][[3]]))
正如我提到的,我的最终目标是获取带有文本标签和每个变量对应值的数据框,并使用因子函数将其分别应用于每个变量。我已经尝试了将近一个月,而且非常困窘,因此可以使用任何帮助。我不确定是否有人可以推荐更好的方法或为我指明正确的方向。我将不胜感激。
答案 0 :(得分:1)
一种方法是将您的labels_list
转换为列表列表:
library(dplyr) # just using dplyr for the pipe %>%, otherwise everything is in base R
# Convert df to list of key:value pairs
labels_list <- labels_list %>%
split(f = labels_list$VariableName) %>%
lapply(function(x) list(key = x$Value, value = x$Label))
例如:
$A1
$A1$key
[1] 1 2 3
$A1$value
[1] "Red" "Blue" "Yellow"
可以使用df
将其映射到您的apply
上。当我将列名放在传递给函数的向量的第一项上时,这有点不客气。
# Map labels onto sample data with factor()
apply(rbind(names(df), df),
2,
function(x) factor(x[2:length(x)],
levels = labels_list[[x[1]]]$key,
labels = labels_list[[x[1]]]$value)) %>%
as.data.frame()
A1 B1 C1 D1
1 Blue Up Dallas Jake
2 Red Down New York Jake
3 Yellow Left Boston Jim
4 Yellow Right Boston John
5 Yellow Down Los Angeles Jake
6 Red Left Atlanta Jake
7 Blue Down New York John
8 Red Down Atlanta Brian
9 Blue Up New York Jim
10 Yellow Down Atlanta Bill
set.seed(1724)
df <- data.frame(A1 = floor(runif(10, 1, 4)),
B1 = floor(runif(10, 1, 5)),
C1 = floor(runif(10, 1, 6)),
D1 = floor(runif(10, 1, 7)))
答案 1 :(得分:1)
如果您不介意一些tidyverse
动词,则可以使用tidyr::gather
重塑数据。一旦形状变长,您就可以通过变量名将数据与代码查找结合起来,然后将其重新整形为宽格式。无论您需要多少列,此工作流都会缩放。
library(dplyr)
library(tidyr)
labels_list <- structure(list(Variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1",
"B1", "C1", "D1"), class = "factor"), Value = c(1L, 2L, 3L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = structure(c(15L,
3L, 18L, 17L, 8L, 12L, 16L, 5L, 1L, 7L, 14L, 13L, 11L, 10L, 9L,
2L, 4L, 6L), .Label = c("Atlanta", "Bill", "Blue", "Bob", "Boston",
"Brian", "Dallas", "Down", "Jake", "Jim", "John", "Left", "Los_Angeles",
"New_York", "Red", "Right", "Up", "Yellow"), class = "factor")), class = "data.frame", row.names = c(NA,
-18L))
df <- tibble(A1 = rep(1:3,2),
B1 = c(1:4, 1, 2),
C1 = c(1:5, 1),
D1 = 1:6
)
在Variable
上迭代行号是传播数据所必需的,但是您可以在不再需要它时将其删除。
df %>%
gather(key = Variable, value = Value) %>%
left_join(labels_list, by = c("Variable", "Value")) %>%
select(-Value) %>%
group_by(Variable) %>%
mutate(row = row_number()) %>%
spread(key = Variable, value = Label)
#> Warning: Column `Variable` joining character vector and factor, coercing
#> into character vector
#> # A tibble: 6 x 5
#> row A1 B1 C1 D1
#> <int> <fct> <fct> <fct> <fct>
#> 1 1 Red Up Boston John
#> 2 2 Blue Down Atlanta Jim
#> 3 3 Yellow Left Dallas Jake
#> 4 4 Red Right New_York Bill
#> 5 5 Blue Up Los_Angeles Bob
#> 6 6 Yellow Down Boston Brian