Question

我有一个数据帧，每行有一个单词，但我想根据列spk和id的值，使用R将其转换为每行的句子/段落。这是一个数据框样本：

id  word    spk
123 hi       1
123 how      1
123 are      1
123 you      1
123 good     2
123 thank    2
123 you      2
123 Anything 1
123 I        1
123 can      1
123 do       1
123 nothing  2
123 for      2
123 now      2
789 what     1
789 is       1
789 the      1
789 issue    1
789 there    2
789 is       2
789 no       2
789 issue    2
789 now      2
789 thank    1
789 you      1
789 for      1
789 contacting1
789 us       1

所需的输出

id  word                           spk
123 hi how are you                  1
123 good thank you                  2
123 Anything I can do               1
123 nothing for now                 2
789 what is the issue               1
789 there is no issue now           2
789 thank you for contacting us     1

任何帮助将不胜感激。谢谢

Answer 1

require(data.table)
setDT(df)

df[, .(word = paste(word, collapse = ' ')
     , id   = unique(id)
     , spk  = unique(spk))
   , by = .(phrase = rleid(spk))]

#    phrase                        word  id spk
# 1:      1              hi how are you 123   1
# 2:      2              good thank you 123   2
# 3:      3           Anything I can do 123   1
# 4:      4             nothing for now 123   2
# 5:      5           what is the issue 789   1
# 6:      6       there is no issue now 789   2
# 7:      7 thank you for contacting us 789   1

在tidyverse这是

df %>%
  mutate(phrase = data.table::rleid(spk)) %>%
  group_by(phrase) %>%
  summarise(id = unique(id),
            words = paste(word, collapse = " "))

Answer 2

这是一种tidyverse方法，也借用了rleid包中的data.table：

library(tidyverse)

df <- 
structure(list(id = c(123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L), 
               word = structure(c(8L, 9L, 2L, 21L, 7L, 16L, 21L, 1L, 10L, 3L, 5L, 14L, 6L, 15L, 20L, 11L, 17L, 12L, 18L, 11L, 13L, 12L, 15L, 16L, 21L, 6L, 4L, 19L), 
              .Label = c("Anything", "are", "can", "contacting", "do", "for", "good", "hi", "how", "I", "is", "issue", "no", "nothing", "now", "thank", "the", "there", "us", "what", "you"), class = "factor"), 
              spk = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -28L))

df %>%
  mutate(spk_num = data.table::rleid(spk)) %>%
  mutate(word = as.character(word)) %>%
  group_by(id, spk, spk_num) %>%
  nest() %>%
  mutate(words = map(data, unlist),
         words = map_chr(words, paste, collapse = " "))

# A tibble: 7 x 5
     id   spk spk_num data             words                      
  <int> <int>   <int> <list>           <chr>                      
1   123     1       1 <tibble [4 x 1]> hi how are you             
2   123     2       2 <tibble [3 x 1]> good thank you             
3   123     1       3 <tibble [4 x 1]> Anything I can do          
4   123     2       4 <tibble [3 x 1]> nothing for now            
5   789     1       5 <tibble [4 x 1]> what is the issue          
6   789     2       6 <tibble [5 x 1]> there is no issue now      
7   789     1       7 <tibble [5 x 1]> thank you for contacting us

概述：首先，制作一个spk_num列，因为您想要的输出希望区分同一ID中的扬声器。接下来，group_by所有内容和nest数据。最后，我们会在整个输出中将paste映射为崩溃，但您首先需要取消单词数据。

Answer 3

您可以使用data.table和函数paste（）通过适当地对行进行分组来执行此操作。唯一的问题是你没有参考你所在的对话的哪个短语，所以首先你必须创建一个新的列来指定这个。

library(data.table)

#This function takes a vector of speakers, say, (1,1,1,2,2,1,1,2,2,2) and gives you which phrase of the conversation you're on (1,1,1,2,2,3,3,4,4,4)

class_phrase = function(spk){
  phrase = 1
  phrase_n = 1
  for(i in 2:length(spk)){
    if(spk[i] == spk[i-1])
      {phrase[i] = phrase[i-1]} else
      {phrase_n = phrase_n + 1
       phrase[i] = phrase_n}
  }
  return(phrase)
}

#Now you use this function with your data.table, grouping by id:

dt[, phrase := class_phrase(spk), by = id]

#Having a phrase number, you can just paste everything together by grouping your original data.table by id and phrase

dt[, .(word = paste(word, collapse = ' '), spk = unique(spk)), by = .(id, phrase)]

Answer 4

另一种dplyr方法，假设您的word列是字符向量：

library(dplyr)

df %>%
  mutate(group = cumsum(ifelse(df$spk != lag(df$spk, default = 0), 1, 0))) %>%
  group_by(id, group) %>%
  mutate(sentence = paste(word, collapse = " ")) %>%
  ungroup %>%
  select(-word, -group) %>%
  distinct() 

# A tibble: 7 x 3
     id   spk sentence                   
  <int> <int> <chr>                      
1   123     1 hi how are you             
2   123     2 good thank you             
3   123     1 Anything I can do          
4   123     2 nothing for now            
5   789     1 what is the issue          
6   789     2 there is no issue now      
7   789     1 thank you for contacting us

Answer 5

这是一个基本版本：

df$i <-  cumsum(c(FALSE,!!diff(df$spk)))
aggregate(word ~ id + spk + i,df,paste,collapse= " ")[-3]
#    id spk                        word
# 1 123   1              hi how are you
# 2 123   2              good thank you
# 3 123   1           Anything I can do
# 4 123   2             nothing for now
# 5 789   1           what is the issue
# 6 789   2       there is no issue now
# 7 789   1 thank you for contacting us

使用R将每个单独行中的单词连接成一行

5 个答案: