在数据框的一行中查找唯一值

时间:2012-09-27 14:53:39

标签: r

我有2个数据框,其中包含人们购买的产品的序列号,按购买次数排序。第一列是custId,接下来的5列是序列号,按购买项目数从左到右排序。

DF1

  id col1 col2 col3 col4 col5
1  1 4742  927 7889   NA   NA
2  2 4964 9295 9174  228 9470
3  3 5834 7758   NA   NA   NA
4  4 2802 9984  323   NA   NA
5  5  179  198 3996 6801 7561
6  6 7755 1252 9684 9940   NA

DF2

  id col6 col7 col8 col9 col10
1  1 1816 6686   NA   NA    NA
2  2 6141 9728 6981 3089  5674
3  3 5659 3931 5022 4361  9264
4  4 3210 2488 9939 7543  7757
5  5 9213 1372 4374 7962  4983
6  6 3451 5646 6069   NA    NA

我正在尝试将它们合并为一组5个序列号,如下所示:

  id col1 col2 col3 col4 col5
1  1 4742  927 7889 1816 6686   
2  2 4964 9295 9174  228 9470
3  3 5834 7758 5022 4361 9264
4  4 2802 9984  323 7543 7757
5  5  179  198 3996 6801 7561
6  6 7755 1252 9684 9940 3451 

有几个问题。

1)如何在一行中找到唯一值。

2)如何维持整行的顺序。

有什么建议吗?

> dput(df1)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), col1 = c(4742, 
4964, 5834, 2802, 179, 7755, 6467, 8671, 2910, 150), col2 = c(927, 
9295, 7758, 9984, 198, 1252, 1664, 5242, 6995, 3875), col3 = c(7889, 
9174, NA, 323, 3996, 9684, 1150, 2973, 9948, 8598), col4 = c(NA, 
228, NA, NA, 6801, 9940, 854, 4744, 4006, 3196), col5 = c(NA, 
9470, NA, NA, 7561, NA, 4342, 1791, 286, 7425)), .Names = c("id", 
"col1", "col2", "col3", "col4", "col5"), row.names = c(NA, -10L
), class = "data.frame")
> dput(df2)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), col6 = c(1816, 
6141, 5659, 3210, 9213, 3451, 2440, 5706, 5281, 7110), col7 = c(6686, 
9728, 3931, 2488, 1372, 5646, 2641, 7851, 5581, 5775), col8 = c(NA, 
6981, 5022, 9939, 4374, 6069, 7525, 4927, 9767, 1331), col9 = c(NA, 
3089, 4361, 7543, 7962, NA, 7526, 4215, 9923, 9887), col10 = c(NA, 
5674, 9264, 7757, 4983, NA, 9996, 5886, 9546, 9419)), .Names = c("id", 
"col6", "col7", "col8", "col9", "col10"), row.names = c(NA, -10L
), class = "data.frame")

2 个答案:

答案 0 :(得分:2)

这有效:

df3 <- cbind(df1,df2[,-1])

subs <- function(x){
  temp <- df3[x,][!is.na(df3[x,])]
  temp2 <- 11-length(temp)
  temp <- c(temp,rep(NA,temp2))
  df3[x,] <<- temp
}

for(i in 1:nrow(df3)){
  subs(i)
}

final.df <- df3[,1:6]

> final.df
   id col1 col2 col3 col4 col5
1   1 4742  927 7889 1816 6686
2   2 4964 9295 9174  228 9470
3   3 5834 7758 5659 3931 5022
4   4 2802 9984  323 3210 2488
5   5  179  198 3996 6801 7561
6   6 7755 1252 9684 9940 3451
7   7 6467 1664 1150  854 4342
8   8 8671 5242 2973 4744 1791
9   9 2910 6995 9948 4006  286
10 10  150 3875 8598 3196 7425

答案 1 :(得分:1)

我认为这会奏效:

x <- cbind(df1[, -1], df2[, -1])
dups <- function(x) x[!duplicated(x)]
new.df <- data.frame(df1[, 1, drop=FALSE], 
    t(apply(x, 1, function(x) dups(na.omit(x))[1:5])))

colnames(new.df)[-1] <- colnames(df1[, -1])
new.df

哪个收益率:

   id col1 col2 col3 col4 col5
1   1 4742  927 7889 1816 6686
2   2 4964 9295 9174  228 9470
3   3 5834 7758 5659 3931 5022
4   4 2802 9984  323 3210 2488
5   5  179  198 3996 6801 7561
6   6 7755 1252 9684 9940 3451
7   7 6467 1664 1150  854 4342
8   8 8671 5242 2973 4744 1791
9   9 2910 6995 9948 4006  286
10 10  150 3875 8598 3196 7425