R:组内唯一值的计数

时间:2019-12-05 04:25:20

标签: r data.table

更新:

我有以下数据框:

df <- data.frame(clause = c("Hello world my dearest","Hello world my dearest","Hello world my dearest","Hello world my dearest","Hello world my dearest","Hello world my dearest"),
                 word = c("Hello", "Hello", "world", "my", "dearest", "dearest"),
                 syllable = c("He", "lo", "world", "my", "dea", "rest"),
                 phrase_ID = c("1", "1", "1", "2", "2", "2"))


这基本上显示了“ hello world,我最亲爱的”子句的语法,包括 2个短语,4个单词和6个音节。 这些词组只用一个短语ID表示。

我进行了以下转换,以获取音节的位置和总数 在单词和短语中。

setDT(df)[, word_ID := rleid(word, phrase_ID)]
df[, poss_syll_in_word := sequence(.N), by = word_ID]                 


我现在要做的是对位置和数量进行相同的转换 短语中的单词和条款中的单词:

df$poss_word_in_phrase <- c("1", "1", "2", "1", "2", "2")
df$n_word_in_phrase <- c("2", "2", "2", "2", "2", "2")

我找不到解决方法。有什么想法吗?

1 个答案:

答案 0 :(得分:0)

编辑

基于更新的数据,我想我们可以使用data.table::rleidn_distinct来获得预期的输出。

data.table

df[, c("poss_word_in_phrase", "n_word_in_phrase") := 
            list(rleid(word_ID), uniqueN(word_ID)), phrase_ID]

或与dplyr

df %>%
  group_by(phrase_ID) %>%
  mutate(poss_word_in_phrase = data.table::rleid(word_ID), 
         n_word_in_phrase = n_distinct(word_ID))

使用tidyverse,我们可以在每个单词处拆分字符串以找出其在word中的位置,并使用str_count来计算单词数。

library(tidyverse)

df %>%
  mutate(poss_word_in_phrase = map2_dbl(str_split(phrase, "\\s+"), word, 
                               ~match(.y, .x)), 
         n_word_in_phrase = str_count(phrase, "\\w+"), 
         poss_word_in_clause = map2_dbl(str_split(clause, "\\s+"), word, 
                               ~match(.y, .x)),
         n_word_in_clause =  str_count(clause, "\\w+"))

#                     clause      phrase    word syllable poss_word_in_phrase
#1 Hello world my dearest Hello world   Hello       He                   1
#2 Hello world my dearest Hello world   Hello       lo                   1
#3 Hello world my dearest Hello world   world    world                   2
#4 Hello world my dearest  my dearest      my       my                   1
#5 Hello world my dearest  my dearest dearest      dea                   2
#6 Hello world my dearest  my dearest dearest     rest                   2

#  n_word_in_phrase poss_word_in_clause n_word_in_clause
#1                2                   1                4
#2                2                   1                4
#3                2                   2                4
#4                2                   3                4
#5                2                   4                4
#6                2                   4                4

数据

通过在clause列中假设您的意思是"world"而不是"word",我稍微纠正了您的输入数据。

df <- data.frame(clause = c("Hello world my dearest","Hello world my dearest",
    "Hello world my dearest","Hello world my dearest",
    "Hello world my dearest","Hello world my dearest"),
    phrase = c("Hello world", "Hello world", "Hello world", "my dearest", 
     "my dearest", "my dearest"),
    word =  c("Hello", "Hello", "world", "my", "dearest", "dearest"),
    syllable = c("He", "lo", "world", "my", "dea", "rest"))