根据每行中其他因素的值生成新的因子变量

时间:2015-08-31 23:19:56

标签: r

我正在尝试创建一个能够生成条件值的新变量的函数。我有一个包含100多列的调查数据集,这些数据集将相应地折叠。阅读this,但没有帮助。

'data.frame':   117 obs. of  7 variables:
 $ fin_partner: Factor w/ 4 levels "","9","No","Yes": 2 2 4 3 2 2 2 2 4 4 ...
 $ fin_parent : Factor w/ 4 levels "","9","No","Yes": 2 2 2 2 2 2 4 3 2 2 ...
 $ fin_kids   : Factor w/ 4 levels "","9","No","Yes": 4 2 2 2 2 2 2 2 2 2 ...
 $ fin_othkids: Factor w/ 4 levels "","9","No","Yes": 2 2 2 2 2 2 3 2 2 2 ...
 $ fin_fam    : Factor w/ 4 levels "","9","No","Yes": 2 2 2 2 2 2 4 3 2 2 ...
 $ fin_friend : Factor w/ 4 levels "","9","No","Yes": 2 2 3 3 2 2 2 2 4 2 ...
 $ fin_oth    : Factor w/ 4 levels "","9","No","Yes": 2 2 2 2 2 2 2 2 4 2 ...

我希望能够根据列对数据集进行子集化,然后将其传递给函数。现在,这些值包含"是"," No"," 999" (缺少)。

我的目标是能够说明,对于每一行,任何列是否包含"是",然后新列将填充"是"。我相信有一种比下面代码更简单的方法,所以我对此持开放态度。

目前我的代码:

trial <- df[, 23:29]
trial.test <- as.data.frame(trial)

composite_score <- function(x){
  # Convert to numeric values
  change_to_number <- function(j) {
    for (i in 1:length(j)){
      if(i == "Yes"){
        i <- 1
      }
      else{
        i <- 0
      }
    }
  }

  x <- change_to_number(x)  

  new_col_var <- function(k){
    if(rowSums(k) > 0){
      k$newvar <- 1
    }
    else {
      k$newvar <- 0
    }
  }

  x <- new_col_var(x)

}

composite_score(trial.test)

代码产生以下错误:

Error in rowSums(k) : 'x' must be an array of at least two dimensions 

数据:

> dput(head(trial.test))
structure(list(fin_partner = structure(c(2L, 2L, 4L, 3L, 2L, 
2L), .Label = c("", "9", "No", "Yes"), class = "factor"), fin_parent = structure(c(2L, 
2L, 2L, 2L, 2L, 2L), .Label = c("", "9", "No", "Yes"), class = "factor"), 
    fin_kids = structure(c(4L, 2L, 2L, 2L, 2L, 2L), .Label = c("", 
    "9", "No", "Yes"), class = "factor"), fin_othkids = structure(c(2L, 
    2L, 2L, 2L, 2L, 2L), .Label = c("", "9", "No", "Yes"), class = "factor"), 
    fin_fam = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", 
    "9", "No", "Yes"), class = "factor"), fin_friend = structure(c(2L, 
    2L, 3L, 3L, 2L, 2L), .Label = c("", "9", "No", "Yes"), class = "factor"), 
    fin_oth = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", 
    "9", "No", "Yes"), class = "factor")), .Names = c("fin_partner", 
"fin_parent", "fin_kids", "fin_othkids", "fin_fam", "fin_friend", 
"fin_oth"), row.names = c(NA, 6L), class = "data.frame")

3 个答案:

答案 0 :(得分:1)

您的change_to_number函数严重损坏 - 它仅将i更改为1或0,这对输入没有任何结果。您可以将其更改为:

change_to_number <- function(j){
        sapply(j, function(x) +(x=="yes"))
}

或者,将整体功能更改为:

composite_score <- function(x){
    +(apply(x, 1, function(z) ("yes" %in% z)))
}

然后运行你的功能:

dat$newcol <- composite_score(dat)

说明:您想知道每行中是否有"yes"。要查看是否存在,您可以为每一行运行以下命令:

"yes" %in% trial.test[1, ]
"yes" %in% trial.test[2, ]....

要做到这一点,你可以使用如下的apply - 我们在z中应用函数“yes”,跨行(1),每行作为z传递给函数:

tempdata <- apply(trial.test, 1, function(z) ("yes" %in% z))
tempdata

每行应获得TRUEFALSE。现在我们可以做一个技巧,其中R会将TRUE转换为1,将FALSE转换为0:

as.numeric(tempdata)
+(tempdata) #same, less typing

如果我们把它们放在一起,你会得到新专栏:

+(apply(trial.test, 1, function(z) ("yes" %in% z)))

答案 1 :(得分:1)

感谢发布数据,它可以实际检查我写的内容!

# Loading your data
trial.test <- structure(list(fin_partner = [... redacted ...], class = "data.frame")

# computing the new variable
# the MARGIN=1 arg precises that we are working on the rows
# the applied function just looks for a "Yes" in the row
# and returns "Yes" if... yes, "No" otherwise.
myvar <- apply(trial.test, MARGIN=1, FUN=function(row) 
    ifelse(any("Yes" %in% row), "Yes", "No"))

# converting it to factor
myvar <- factor(myvar)

# putting it in trial.test just for illustration
cbind(trial.test, summary=myvar)

这给出了:

  fin_partner fin_parent fin_kids fin_othkids fin_fam fin_friend fin_oth summary
1           9          9      Yes           9       9          9       9     Yes
2           9          9        9           9       9          9       9      No
3         Yes          9        9           9       9         No       9     Yes
4          No          9        9           9       9         No       9      No
5           9          9        9           9       9          9       9      No
6           9          9        9           9       9          9       9      No

答案 2 :(得分:0)

library(tidyr)
library(dplyr)
library(magrittr)

trial.test %<>% mutate(row_number = 1:n())

answer = 
  trial.test %>%
  gather(variable, value, -row_number) %>%
  filter(value == "Yes") %>%
  select(-variable) %>%
  distinct %>%
  right_join(trial.test)