如何使用R删除数据较少的重复行?

时间:2015-03-30 19:43:57

标签: r data.table

我们说我有以下数据表(data):

row,or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source
1,VA1,VA2,2014-05-24,,0,0,2124,2014-05-22 15:50:16,,,,2014-05-22 12:20:03,tp
2,VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,tp
3,VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,A1,,,2014-05-22 12:20:03,tp
4,VA1,VA2,2014-06-05,,0,0,2124,2014-05-22 15:48:24,,,,2014-05-22 12:20:03,tp
5,VA1,VA2,2014-06-09,,0,0,2124,2014-05-22 15:37:35,,,,2014-05-22 12:20:03,tp
6,VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp
7,VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp

我想删除重复的行。如果我data <- unique(data, by = NULL),则只删除最后一行(第7行),但我也想删除第2行。我可以使用setkey()定义键:

setkey(data, row,or,d,ddate,rdate,changes,class,price,fdate,number,minutes,added,source)

,它将删除第2行或第3行。但我想删除行数,这些行具有较少的数据并保留包含更多数据的行。即在上面的情况中,应该删除第2行,但第3行应该保留,因为它在列company中有附加值。我该怎么办?

1 个答案:

答案 0 :(得分:0)

这个怎么样:

# whatever the important columns are for your uniqueness criterion
important.cols = c('or','d','ddate','rdate','changes','class','price','fdate')

# pick row with max number of non-empty elements
dt[, .SD[which.max(rowSums(.SD != "", na.rm = T))], by = important.cols]