我有一个数据帧,每行有一个单词,但我想根据列spk和id的值,使用R将其转换为每行的句子/段落。这是一个数据框样本:
id word spk
123 hi 1
123 how 1
123 are 1
123 you 1
123 good 2
123 thank 2
123 you 2
123 Anything 1
123 I 1
123 can 1
123 do 1
123 nothing 2
123 for 2
123 now 2
789 what 1
789 is 1
789 the 1
789 issue 1
789 there 2
789 is 2
789 no 2
789 issue 2
789 now 2
789 thank 1
789 you 1
789 for 1
789 contacting1
789 us 1
所需的输出
id word spk
123 hi how are you 1
123 good thank you 2
123 Anything I can do 1
123 nothing for now 2
789 what is the issue 1
789 there is no issue now 2
789 thank you for contacting us 1
任何帮助将不胜感激。谢谢
答案 0 :(得分:2)
require(data.table)
setDT(df)
df[, .(word = paste(word, collapse = ' ')
, id = unique(id)
, spk = unique(spk))
, by = .(phrase = rleid(spk))]
# phrase word id spk
# 1: 1 hi how are you 123 1
# 2: 2 good thank you 123 2
# 3: 3 Anything I can do 123 1
# 4: 4 nothing for now 123 2
# 5: 5 what is the issue 789 1
# 6: 6 there is no issue now 789 2
# 7: 7 thank you for contacting us 789 1
在tidyverse这是
df %>%
mutate(phrase = data.table::rleid(spk)) %>%
group_by(phrase) %>%
summarise(id = unique(id),
words = paste(word, collapse = " "))
答案 1 :(得分:1)
这是一种tidyverse
方法,也借用了rleid
包中的data.table
:
library(tidyverse)
df <-
structure(list(id = c(123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L, 789L),
word = structure(c(8L, 9L, 2L, 21L, 7L, 16L, 21L, 1L, 10L, 3L, 5L, 14L, 6L, 15L, 20L, 11L, 17L, 12L, 18L, 11L, 13L, 12L, 15L, 16L, 21L, 6L, 4L, 19L),
.Label = c("Anything", "are", "can", "contacting", "do", "for", "good", "hi", "how", "I", "is", "issue", "no", "nothing", "now", "thank", "the", "there", "us", "what", "you"), class = "factor"),
spk = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -28L))
df %>%
mutate(spk_num = data.table::rleid(spk)) %>%
mutate(word = as.character(word)) %>%
group_by(id, spk, spk_num) %>%
nest() %>%
mutate(words = map(data, unlist),
words = map_chr(words, paste, collapse = " "))
# A tibble: 7 x 5
id spk spk_num data words
<int> <int> <int> <list> <chr>
1 123 1 1 <tibble [4 x 1]> hi how are you
2 123 2 2 <tibble [3 x 1]> good thank you
3 123 1 3 <tibble [4 x 1]> Anything I can do
4 123 2 4 <tibble [3 x 1]> nothing for now
5 789 1 5 <tibble [4 x 1]> what is the issue
6 789 2 6 <tibble [5 x 1]> there is no issue now
7 789 1 7 <tibble [5 x 1]> thank you for contacting us
概述:首先,制作一个spk_num
列,因为您想要的输出希望区分同一ID中的扬声器。接下来,group_by
所有内容和nest
数据。最后,我们会在整个输出中将paste
映射为崩溃,但您首先需要取消单词数据。
答案 2 :(得分:0)
您可以使用data.table和函数paste()通过适当地对行进行分组来执行此操作。唯一的问题是你没有参考你所在的对话的哪个短语,所以首先你必须创建一个新的列来指定这个。
library(data.table)
#This function takes a vector of speakers, say, (1,1,1,2,2,1,1,2,2,2) and gives you which phrase of the conversation you're on (1,1,1,2,2,3,3,4,4,4)
class_phrase = function(spk){
phrase = 1
phrase_n = 1
for(i in 2:length(spk)){
if(spk[i] == spk[i-1])
{phrase[i] = phrase[i-1]} else
{phrase_n = phrase_n + 1
phrase[i] = phrase_n}
}
return(phrase)
}
#Now you use this function with your data.table, grouping by id:
dt[, phrase := class_phrase(spk), by = id]
#Having a phrase number, you can just paste everything together by grouping your original data.table by id and phrase
dt[, .(word = paste(word, collapse = ' '), spk = unique(spk)), by = .(id, phrase)]
答案 3 :(得分:0)
另一种dplyr
方法,假设您的word
列是字符向量:
library(dplyr)
df %>%
mutate(group = cumsum(ifelse(df$spk != lag(df$spk, default = 0), 1, 0))) %>%
group_by(id, group) %>%
mutate(sentence = paste(word, collapse = " ")) %>%
ungroup %>%
select(-word, -group) %>%
distinct()
# A tibble: 7 x 3
id spk sentence
<int> <int> <chr>
1 123 1 hi how are you
2 123 2 good thank you
3 123 1 Anything I can do
4 123 2 nothing for now
5 789 1 what is the issue
6 789 2 there is no issue now
7 789 1 thank you for contacting us
答案 4 :(得分:0)
这是一个基本版本:
df$i <- cumsum(c(FALSE,!!diff(df$spk)))
aggregate(word ~ id + spk + i,df,paste,collapse= " ")[-3]
# id spk word
# 1 123 1 hi how are you
# 2 123 2 good thank you
# 3 123 1 Anything I can do
# 4 123 2 nothing for now
# 5 789 1 what is the issue
# 6 789 2 there is no issue now
# 7 789 1 thank you for contacting us