Question

我有一个数据框，其中'收入'是数字，而A，B，C，D，E ...是二进制矢量。

Earning A B C D E ...**1000 such binary vector columns**
  21    1 0 0 1 1
  45    0 0 0 1 1
  67    0 0 0 1 1
  23    0 0 0 0 1
  44    0 0 0 1 1
  77    1 1 0 0 1
  89    0 1 0 1 1
  90    1 0 0 0 0

在A，B，C ...... 1000列中，我想保留colSums最大的前400列。对于其他600列，我想将它们标记为标记为“其他”的一列，其中将具有0或1（基本上“其他”列中的每个行条目是最小colSum 600列之间的OR）。

总体而言，目的是最终使用A，B，C，D，E中最受欢迎的前400列...（其中流行度在二进制向量中被测量为'1'）回归到收入。

Answer 1

假设dfs是包含数据的data.frame。

# +1/-1 is to keep 'Earnings' at the beginning of the data.frame
new_order = order(colSums(dfs[,-1], na.rm = TRUE), decreasing = TRUE) + 1
res = cbind(
    dfs[, c(1, new_order[1:400])], 
    other = 1*(rowSums(dfs[, new_order[-(1:400)]])>0)
    )

res生成的data.frame包含新的列顺序。

R按照colSums排列列，并在同一类别下列出多个列

1 个答案: