Question

我将这个问题分成两部分，第一部分是一般性问题，第二部分是具体问题。

首先 - 我想知道是否有可能标记数字因子但仍保持其原始数字水平的方法。这特别令人困惑，因为我意识到当我们将一个标签参数传递给一个因子时，它就变成了这个因子的水平，例如：

x<- factor(c(1,2,3, 2, 3, 1, 2), levels = c(1, 2, 3), labels = c("a", "b", "c"))
levels(x)
#[1] "a" "b" "c"
labels(x)
#[1] "1" "2" "3" "4" "5" "6" "7"

我想知道是否有一种方法，就像它在Stata中一样，标记一个因子的类别。我希望能够在其元素显示为“a”，“b”或“c”时对x进行求和，但保持值为1,2或3.

第二 - 我问这个是因为我有一个非常大的数据集，其中包含带有数字类别的列。这个数据集附带了xlsx中的字典，我将其读入并处理为R，因此每列都有其数字类别及其各自的标签。我正在尝试阅读字典，在列列表中创建类别和标签列表，然后读取数据集，遍历列并标记变量。这些标签很重要，因此每次我必须在数据集上解释某些内容时，我不必查看字典。数字级别很重要，因为我有很多虚拟变量（是或否变量），我希望能够对它们求和。

这是我的代码（我使用data.table包）：

dic<- readRDS(dictionary_filename)

            # Reading data set #

              data <- fread(dataset_filename, header = T, sep = "|", encoding = "UTF-8", na.strings = c("NA", ""))

            # Treating the data.set #

                # Identifying which lines of the dictionary have categorized variables. This is very specific to my dictionary strcture #

                  index<- which(!is.na(dic$num.categoria))

                # storing the names of columns that have categorized variables #

                  names_var<- dic$`Var name`[index]
                  names_var<- names_var[!is.na(names_var)]

                # Creating a data frame with categorized variables which will be later split into lists #

                  df<- as.data.frame(dic[index,])          
                # Transforming the index column to factor so it is possible to split the data frame into a list with sublists for each categorized column #      
                  df$N<- as.factor(df$N)     
                # Splitting the data frame to list      
                  lst<- split(df, df$N)      
                # Creating a labels list and a levels list #     
                  lbs<- list()                      
                  lvs<- list()
                        for (i in 1:length(lst)){        
                      lbs[[i]]<- as.vector(lst[[i]]$category)
                      lvs[[i]]<- as.vector(lst[[i]]$category.number)              
                  }      
                # Changing the data set columns into factors with ther respective levels and labels  #      
                  k<- 1      
                  for (var in names_var){        
                      set(data, j =var, value = factor(data[[var]], levels = lvs[[k]], labels = lbs[[k]]))        
                      k<- k +1
                  }

我意识到代码有点抽象，因为我没有提供数据集或字典，但它只是让你有一个想法。我的代码工作，它运行没有错误，它做我希望它会做的事情（所有分类列现在显示他们的标签，例如，“是”或“否”，当它是1或0之前）。除了我无法再访问级别中的原始数字这一事实，我需要在项目的下一部分中使用。

如果有一般方法这样做会更好，因为我在一个函数中运行此代码，其中许多列具有不同的数据集和不同的字典。有没有办法实现这个目标？

PS。：我已经阅读了R中的文档以及这些问题的答案：

Factor, levels, and original values

Having issues using order function in R

但不幸的是，我自己无法弄清楚，很明显在“因素”中使用“标签”参数并不是完成它的方法。

非常感谢你！

如何标记因子但仍保留其原始水平的值 - R.

0 个答案: