我需要一些帮助来过滤R中的data.table。我有一个数百万行的文件,每行有4个字。
我想删除一些我不需要的行。每行有4个单词和一个频率。
对于前3个单词的每个组合,我想保留3个“最频率”。
Bellow是data.table的一个例子,以及我作为输出所需要的东西。
text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
"Go away with him", "Go away with her",
"I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
"I am the most")
frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
0.05, 0.04,
0.1, 0.06, 0.3, 0.05, 0.1,
0.2)
DT <- data.table(text = text, frequency = frequency)
#Original output:
text frequency
1: Run to the hills 0.100
2: Run to the mountains 0.090
3: Run to the highway 0.200
4: Run to the top 0.050
5: Run to the horizon 0.001
6: Go away with him 0.050
7: Go away with her 0.040
8: I am a good 0.100
9: I am a bad 0.060
10: I am a uggly 0.300
11: I am a guy 0.050
12: I am a woman 0.100
13: I am awesome 0.200
需要输出:(只是来自相同“前3个字”的前3个频率)
text frequency
1: Go away with him 0.05
2: Go away with her 0.04
3: I am a uggly 0.30
4: I am a woman 0.10
5: I am a good 0.10
6: I am the most 0.20
7: Run to the highway 0.20
8: Run to the hills 0.10
9: Run to the mountains 0.09
所以,我想保留按频率列排序的前三名:“运行到XXXXX”,“使用XXXXX”,“我是XXXXX”,“我是XXXXX”
在这种情况下,我会放弃:“跑到顶端”,“奔向地平线”,“我是一个坏人”,“我是一个人”
我正在考虑使用正则表达式,但我现在有点迷失了: - \
答案 0 :(得分:4)
您可以使用sub()
创建一个由前三个单词组成的id列,然后使用它来获取频率的前三个值。
比说起来更容易......
library(data.table)
## add an id column containing only the first three words
DT[, id := sub(" \\S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
# text frequency
# 1: Go away with him 0.05
# 2: Go away with her 0.04
# 3: I am a uggly 0.30
# 4: I am a good 0.10
# 5: I am a woman 0.10
# 6: I am the most 0.20
# 7: Run to the highway 0.20
# 8: Run to the hills 0.10
# 9: Run to the mountains 0.09
答案 1 :(得分:1)
DT[,group := sub(" \\S+$", "", text)]
DT[,grank:=base::rank(-frequency),by=group]
DT[grank <= 3]
使用了rank函数,因此OP可以指定如何处理tie。