在带有sjlabelled包的R中使用值标签

时间:2018-10-12 11:03:32

标签: r

最近我已经从STATA切换到R。 在STATA中,您有一个称为值标签的东西。例如,通过使用命令编码,您可以将字符串变量转换为数字,并在每个数字上附加一个字符串标签。 由于字符串变量包含名称(大多数情况下会重复出现),因此使用值标签可在处理大型数据集时节省大量空间。 不幸的是,我没有设法在R中找到类似的命令。我发现唯一可以将标签附加到我的值矢量上的软件包是sjlabelled。它完成了附件,但是当我尝试将附加的数字矢量合并到另一个数据框时,标签似乎“掉了”。

例如:假设我们从Wikipedia提取了一个段落,只是为了拥有一个字符串变量。

paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
          # Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences       <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences       <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences)       # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
The labels I was attaching using set labels 

谢谢! 附言抱歉,就像我之前说的那样,我的R语言非常新。

2 个答案:

答案 0 :(得分:0)

来源:https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

  

...   在2007年6月左右,R引入了CHARSXP元素的散列   底层的C代码归功于Seth Falcon。这是什么意思   有效地,字符串被哈希为整数   表示形式,并存储在R中的全局表中。   R中需要字符串,可以由其底层引用   整数。这有效地在全球范围内实施了因子编码   之前的字符串行为。实施之后,   从效率的角度来看,通过编码几乎无法获得   字符变量作为因素。当然,您仍然需要使用   建模功能的“因素”。   ...

答案 1 :(得分:0)

我稍微调整了您的初始测试数据。我被如此多的字符串弄糊涂了,不确定该问题是否必要。让我知道,如果我错过了一点。这是我的调整和答案:

#####################################
# initial problem rephrased
#####################################

# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)

# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))

# show labels in this frame
get_labels(df1)

# include associated values
get_labels(df1, values = "as.prefix")

# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))

# labels lost after merge
get_labels(df_merge, values = "as.prefix")

#####################################
# solution with dplyr 
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")

解决方案归因于:

Merging and keeping variable labels in R