将txt(会话)分隔为以演讲者姓名为变量的列

时间:2019-10-04 23:29:31

标签: r

我是R语言中的文本挖掘的新手。我有多个txt文件,这些讲话者在同一位演讲者之间进行的会话组织如下:

speaker one [speakers' names are on their own line]
what speaker one says [paragraph of each speaker's speech after 
line break from name]
[empty line]
speaker two
what speaker two says
[empty line]
speaker one
what speaker one replies
[empty line]
speaker three
what speaker three says
...

我想将文本分成每行一行,每列作为发言者的名字。我希望发言者在每个文本中说的所有内容组合在每一行的一个单元格中,其他发言者也一样。像这样:

text   "speaker one"                "speaker two"              ...
text1  everything speaker one said  everything speaker two said
text2  everything speaker one said  everything speaker two said
...

任何有关入门的帮助都将不胜感激。

1 个答案:

答案 0 :(得分:0)

使用一些tidyverse软件包,您可以到达那里。首先使用df <- data.frame(Condition = rep("CHR", 3), IMG = c(14, 2, 13)) df$Imgs <- list(c(13, 19, 14)) 阅读文本,然后在空行上拆分,使用readr::read_file将其读入data.frames。由于数据现在位于列表中,因此使用readr::read_delim将所有这些折叠为一个data.frame。 bind_rows与列名称匹配,因此发言者的所有文本都位于正确的列中。根据您想要的是第一个解决方案还是第二个解决方案。

我将合并多个文本文件交给您。

bind_rows

在一行中折叠所有文本:

这需要更多的工作,首先以整齐的长格式收集数据,折叠文本,然后再次将其散布开。如果要查看每个步骤中发生的情况,请分块运行语句。

library(readr)
library(tidyr)
library(dplyr)

# read file into a character vector
text <- readr::read_file("conversation.txt")

# split the text on the empty line
split_text <- strsplit(text, split = "\r\n\r\n")

# read the data in again with read_delim. This will generate a list of data.frames
list_text <- lapply(unlist(split_text), function(x) readr::read_delim(x, col_names = TRUE, delim = "\t"))

# use bind_rows from dplyr to combine everything into 1 tibble. bind_rows matches on the column names.
list_text %>% 
  bind_rows

# A tibble: 5 x 3
  `speaker one`                                                      `speaker two`         `speaker three`         
  <chr>                                                              <chr>                 <chr>                   
1 what speaker one says is in this paragraph.                        NA                    NA                      
2 It might be in multiple lines, but not seperated by an empty line. NA                    NA                      
3 NA                                                                 what speaker two says NA                      
4 what speaker one replies                                           NA                    NA                      
5 NA  

文本文件对话中使用的文本.txt

list_text %>% 
  bind_rows %>% 
  pivot_longer(everything(), 
               names_to = "speakers",
               values_to = "text",
               values_drop_na = TRUE) %>% 
  group_by(speakers) %>% 
  summarise(text = paste0(text, collapse = " ")) %>% 
  pivot_wider(names_from = speakers, values_from = text)

# A tibble: 1 x 3
  `speaker one`                                                                                   `speaker three`       `speaker two`     
  <chr>                                                                                           <chr>                 <chr>             
1 what speaker one says is in this paragraph. It might be in multiple lines, but not seperated b~ what speaker three s~ what speaker two ~