更新:
我有以下数据框:
df <- data.frame(clause = c("Hello world my dearest","Hello world my dearest","Hello world my dearest","Hello world my dearest","Hello world my dearest","Hello world my dearest"),
word = c("Hello", "Hello", "world", "my", "dearest", "dearest"),
syllable = c("He", "lo", "world", "my", "dea", "rest"),
phrase_ID = c("1", "1", "1", "2", "2", "2"))
这基本上显示了“ hello world,我最亲爱的”子句的语法,包括 2个短语,4个单词和6个音节。 这些词组只用一个短语ID表示。
我进行了以下转换,以获取音节的位置和总数 在单词和短语中。
setDT(df)[, word_ID := rleid(word, phrase_ID)]
df[, poss_syll_in_word := sequence(.N), by = word_ID]
我现在要做的是对位置和数量进行相同的转换 短语中的单词和条款中的单词:
df$poss_word_in_phrase <- c("1", "1", "2", "1", "2", "2")
df$n_word_in_phrase <- c("2", "2", "2", "2", "2", "2")
我找不到解决方法。有什么想法吗?
答案 0 :(得分:0)
编辑
基于更新的数据,我想我们可以使用data.table::rleid
和n_distinct
来获得预期的输出。
在data.table
df[, c("poss_word_in_phrase", "n_word_in_phrase") :=
list(rleid(word_ID), uniqueN(word_ID)), phrase_ID]
或与dplyr
df %>%
group_by(phrase_ID) %>%
mutate(poss_word_in_phrase = data.table::rleid(word_ID),
n_word_in_phrase = n_distinct(word_ID))
使用tidyverse
,我们可以在每个单词处拆分字符串以找出其在word
中的位置,并使用str_count
来计算单词数。
library(tidyverse)
df %>%
mutate(poss_word_in_phrase = map2_dbl(str_split(phrase, "\\s+"), word,
~match(.y, .x)),
n_word_in_phrase = str_count(phrase, "\\w+"),
poss_word_in_clause = map2_dbl(str_split(clause, "\\s+"), word,
~match(.y, .x)),
n_word_in_clause = str_count(clause, "\\w+"))
# clause phrase word syllable poss_word_in_phrase
#1 Hello world my dearest Hello world Hello He 1
#2 Hello world my dearest Hello world Hello lo 1
#3 Hello world my dearest Hello world world world 2
#4 Hello world my dearest my dearest my my 1
#5 Hello world my dearest my dearest dearest dea 2
#6 Hello world my dearest my dearest dearest rest 2
# n_word_in_phrase poss_word_in_clause n_word_in_clause
#1 2 1 4
#2 2 1 4
#3 2 2 4
#4 2 3 4
#5 2 4 4
#6 2 4 4
数据
通过在clause
列中假设您的意思是"world"
而不是"word"
,我稍微纠正了您的输入数据。
df <- data.frame(clause = c("Hello world my dearest","Hello world my dearest",
"Hello world my dearest","Hello world my dearest",
"Hello world my dearest","Hello world my dearest"),
phrase = c("Hello world", "Hello world", "Hello world", "my dearest",
"my dearest", "my dearest"),
word = c("Hello", "Hello", "world", "my", "dearest", "dearest"),
syllable = c("He", "lo", "world", "my", "dea", "rest"))