如何将函数应用于每一行并返回R中的行?

时间:2015-10-30 20:14:42

标签: r

我有以下数据框,并且我将顶级产品中的第一个product_id替换为不会出现在该行中的NAs。为了给出一些背景信息,这些是产品推荐。

虽然我对plyr和sapply有一些经验,但我很难找到实现这一目标的正确方法。

我认为下面的代码说明了一切。

> head(recs_with_na)
      V1   V2   V3   V4
148 1227 1213 <NA> <NA>
249 1169 1221 <NA> <NA>
553 1227 1162 <NA> <NA>
732 1227 1162 <NA> <NA>
765 1227 1162 <NA> <NA>
776 1227 1162 <NA> <NA>
> top_products
   product_id count
21       1162  7917
65       1213  4839
19       1160  4799
11       1152  3543
34       1175  3423
75       1227  2719
2        1143  2396
13       1154  2168
> fill_nas_with_top <- function(data, top_products) {
+   top_products_copy <- top_products
+   mydata <- data
+   #mydata <- as.data.frame(data)
+   for (i in 1:4) {
+     if (is.na(mydata[,i])) {
+       mydata[,i] <- top_products_copy[1,1]
+       top_products_copy <- top_products_copy[-1,]      
+       
+     }
+     else {
+       top_products_copy <- top_products_copy[top_products_copy[,1] != mydata[,i],]
+     }
+   }  
+   return(mydata)
+ }
> sapply(recs_with_na, fill_nas_with_top, top_products)
 Show Traceback

 Rerun with Debug
 Error in `[.default`(mydata, , i) : incorrect number of dimensions 

1 个答案:

答案 0 :(得分:1)

R uses pass-by-value semantics. Your function will get copies of data and top_products each time it is called so no need for you to make defensive copies.

Because pass-by-value means creating copies (and for many other reasons too), it is a good practice to give your functions the smallest possible amount of information they need to accomplish their task. In this case, you don't need to pass the whole top_products data frame. A vector of product_ids will do.

fill_nas_with_top <- function(data, top) {
    for (i in 1:4) {
        d <- data[i]
        if (is.na(d)) {
            ## Find the first not already existing value
            for (t in top) {
                top <- top[-1]
                if (!t %in% data) {
                    data[i] <- t
                    break;
                }
            }
        } else {
            # This no longer assumes that product_ids in top are ordered as in data
            if (d %in% top) top <- top[-which(d == top)]
        }
    }
    return(data)
}   

Called like this (observe that we call it with a vector of product_ids in top_products):

as.data.frame(t(apply(recs_with_na, 1, fill_nas_with_top, top_products[,1])))

will produce:

    V1   V2   V3   V4
1 1227 1213 1162 1160
2 1169 1221 1162 1213
3 1227 1162 1213 1160
4 1227 1162 1213 1160
5 1227 1162 1213 1160
6 1227 1162 1213 1160