使用R从表中删除重复值

时间:2015-07-16 05:37:12

标签: r duplicates

verbdata_bkp1[1:5,2:4]
                               V2                      V3             V4
1.content Document Not Received~2 Document not received~2           <NA>
2.content          Payment Ease~1                    QR~1           <NA>
3.content       Payment Receipt~2       Payment Receipt~2 Payment Ease~1
4.content             Surrender~1       Product Returns~1           <NA>
5.content                    <NA>                    <NA>           <NA>`

所以在第1行,我们有2&#34;文件未收到~2&#34;和2&#34;付款收据〜2&#34;在第3行中。这些应该只在行中出现一次。

1 个答案:

答案 0 :(得分:0)

一种选择是循环遍历行,将元素转换为大写或小写,并检查duplicated的重复项,并将重复值更改为NA

 df1[-1] <- t(apply(df1[-1], 1, function(x) 
                x[NA^duplicated(toupper(x))*seq_along(x)]))
 df1
  #        V1                      V2                V3             V4
  #1 1.content Document Not Received~2              <NA>           <NA>
  #2 2.content          Payment Ease~1              QR~1           <NA>
  #3 3.content       Payment Receipt~2              <NA> Payment Ease~1
  #4 4.content             Surrender~1 Product Returns~1           <NA>
  #5 5.content                    <NA>              <NA>          <NA>`

注意:我没有使用第一列值,因为它似乎是标识符列。

数据

 df1 <- structure(list(V1 = c("1.content", "2.content", "3.content", 
 "4.content", "5.content"), V2 = c("Document Not Received~2", 
 "Payment Ease~1", "Payment Receipt~2", "Surrender~1", "<NA>"), 
V3 = c("Document not received~2", "QR~1", "Payment Receipt~2", 
"Product Returns~1", "<NA>"), V4 = c("<NA>", "<NA>", "Payment Ease~1", 
"<NA>", "<NA>`")), .Names = c("V1", "V2", "V3", "V4"),
 class = "data.frame", row.names = c(NA, -5L))