Question

我需要一些帮助来过滤R中的data.table。我有一个数百万行的文件，每行有4个字。

我想删除一些我不需要的行。每行有4个单词和一个频率。

对于前3个单词的每个组合，我想保留3个“最频率”。

Bellow是data.table的一个例子，以及我作为输出所需要的东西。

text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
          "Go away with him", "Go away with her",
          "I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
          "I am the most")

frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
               0.05, 0.04,
               0.1, 0.06, 0.3, 0.05, 0.1,
               0.2)

DT <- data.table(text = text, frequency = frequency)

#Original output:
                    text frequency
 1:     Run to the hills     0.100
 2: Run to the mountains     0.090
 3:   Run to the highway     0.200
 4:       Run to the top     0.050
 5:   Run to the horizon     0.001
 6:     Go away with him     0.050
 7:     Go away with her     0.040
 8:          I am a good     0.100
 9:           I am a bad     0.060
10:         I am a uggly     0.300
11:           I am a guy     0.050
12:         I am a woman     0.100
13:         I am awesome     0.200

需要输出:(只是来自相同“前3个字”的前3个频率）

                 text frequency
  1: Go away with him      0.05     
  2: Go away with her      0.04
  3: I am a uggly          0.30  
  4: I am a woman          0.10
  5: I am a good           0.10
  6: I am the most         0.20
  7: Run to the highway    0.20
  8: Run to the hills      0.10
  9: Run to the mountains 0.09

所以，我想保留按频率列排序的前三名：“运行到XXXXX”，“使用XXXXX”，“我是XXXXX”，“我是XXXXX”

在这种情况下，我会放弃：“跑到顶端”，“奔向地平线”，“我是一个坏人”，“我是一个人”

我正在考虑使用正则表达式，但我现在有点迷失了： - \

Answer 1

您可以使用sub()创建一个由前三个单词组成的id列，然后使用它来获取频率的前三个值。

比说起来更容易......

library(data.table)

## add an id column containing only the first three words
DT[, id := sub(" \\S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
  DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
#                    text frequency
# 1:     Go away with him      0.05
# 2:     Go away with her      0.04
# 3:         I am a uggly      0.30
# 4:          I am a good      0.10
# 5:         I am a woman      0.10
# 6:        I am the most      0.20
# 7:   Run to the highway      0.20
# 8:     Run to the hills      0.10
# 9: Run to the mountains      0.09

Answer 2

DT[,group := sub(" \\S+$", "", text)]
DT[,grank:=base::rank(-frequency),by=group]
DT[grank <= 3]

使用了rank函数，因此OP可以指定如何处理tie。

根据排序顺序从data.table中排除行

2 个答案: