我有一个包含2方之间聊天对话的数据集。我想将数据集合并到人1和人2之间的逐行对话。
有时人们会输入多个句子,这些句子会在数据框中显示为多个记录。
这是我想弄清楚的伪代码。
这就是现在的数据框:
id timestamp line_by line_text
1234 02:54.3 Person1 Text Line 1
1234 03:23.8 Person2 Text Line 2
1234 03:47.0 Person2 Text Line 3
1234 04:46.8 Person1 Text Line 4
1234 05:46.2 Person1 Text Line 5
9876 06:44.5 Person2 Text Line 6
9876 07:27.6 Person1 Text Line 7
9876 08:17.5 Person2 Text Line 8
9876 10:20.3 Person2 Text Line 9
我希望将数据更改为以下内容:
id timestamp line_by line_text
1234 02:54.3 Person1 Text Line 1
1234 03:47.0 Person2 Text Line 2Text Line 3
1234 05:46.2 Person1 Text Line 4Text Line 5
9876 06:44.5 Person2 Text Line 6
9876 07:27.6 Person1 Text Line 7
9876 10:20.3 Person2 Text Line 8Text Line 9
披露:我已经问过同样的问题但是对于python中的pandas。这就是我在R和Python都陷入困境的地方。
答案 0 :(得分:1)
试试这个
library(dplyr)
library(data.table)
df %>%
group_by(id, grp = rleid(line_by)) %>%
summarise(timestamp = last(timestamp),
line_by = unique(line_by), line_text = paste(line_text, collapse=", ")) %>%
select(-grp)
除了rleid(...)
id
分组
输出
# A tibble: 6 x 4
# Groups: id [2]
# id timestamp line_by line_text
# <int> <chr> <chr> <chr>
# 1 1234 02:54.3 Person1 TextLine1
# 2 1234 03:47.0 Person2 TextLine2, TextLine3
# 3 1234 05:46.2 Person1 TextLine4, TextLine5
# 4 9876 06:44.5 Person2 TextLine6
# 5 9876 07:27.6 Person1 TextLine7
# 6 9876 10:20.3 Person2 TextLine8, TextLine9
答案 1 :(得分:1)
仅使用dplyr
的变体:
library(dplyr)
df %>% group_by(id,line_by,grp = cumsum(line_by !=lag(line_by,1,""))) %>%
summarise(timestamp = last(timestamp),line_text = paste(line_text,collapse="")) %>%
select(-grp)