我正在尝试创建一个变量,该变量标识向量中的字符串是第一次出现,出现在前三个还是超过三个。例如:
在下面的数据集中,我有名字(会有更多名字),文本和dup变量。我希望dup变量标识文本是否是第一次出现(起源),是否在前三个出现中(第三个出现)或是否出现了超过3次(MoreThanThree)。我还需要为每个人做到这一点...但是我想我可以弄清楚这一点。预先感谢您的帮助!
name =c("T","T","T","T","T","T","T","T","T","T")
text =c("a","b","a","a","b","c","a","a","b","a")
dup =c("origin","origin","FirstThree","FirstThree","FirstThree","origin","MoreThanThree","MoreThanThree","FirstThree","MoreThanThree")
dfA = data.frame(name,text,dup)
name text dup
1 T a origin
2 T b origin
3 T a FirstThree
4 T a FirstThree
5 T b FirstThree
6 T c origin
7 T a MoreThenThree
8 T a MoreThenThree
9 T b FirstThree
10 T a MoreThenThree
答案 0 :(得分:2)
您可以使用data.table::rowid
进行两次ifelse
支票
dfA[, ict := {
r <- rowid(text)
ifelse(r == 1, 'origin',
ifelse(r <= 3, 'FirstThree',
'MoreThanThree'))}
, by = name]
dfA
# name text dup ict
# 1: T a origin origin
# 2: T b origin origin
# 3: T a FirstThree FirstThree
# 4: T a FirstThree FirstThree
# 5: T b FirstThree FirstThree
# 6: T c origin origin
# 7: T a MoreThanThree MoreThanThree
# 8: T a MoreThanThree MoreThanThree
# 9: T b FirstThree FirstThree
# 10: T a MoreThanThree MoreThanThree
您也可以使用cut
。唯一的区别是这会产生一个因素而不是特征。如果您有3个以上类别,则可能会有用
dfA[, ict := cut(rowid(text), c(0, 1, 3, Inf),
labels = c('origin', 'FirstThree', 'MoreThanThree'))
, by = name]
答案 1 :(得分:0)
在dplyr
中,我们可以在row_number()
语句中比较case_when
。
library(dplyr)
dfA %>%
group_by(text) %>%
mutate(row = row_number(),
dup = case_when(row == 1 ~ "origin",
row <= 3 ~ "FirstThree",
TRUE ~ "MoreThenThree"))
# name text row dup
# <fct> <fct> <int> <chr>
# 1 T a 1 origin
# 2 T b 1 origin
# 3 T a 2 FirstThree
# 4 T a 3 FirstThree
# 5 T b 2 FirstThree
# 6 T c 1 origin
# 7 T a 4 MoreThenThree
# 8 T a 5 MoreThenThree
# 9 T b 3 FirstThree
#10 T a 6 MoreThenThree
如果不需要,我们可以稍后删除row
列。