我希望将一些电视剧本分成两个变量的数据框:(1)口语对话和(2)说话者。
以下是示例数据:http://www.buffyworld.com/buffy/transcripts/127_tran.html
通过以下方式加载到R:
require(rvest)
url <- 'http://www.buffyworld.com/buffy/transcripts/127_tran.html')
url <- read_html(url)
all <- url %>% html_text()
[1] "Selfless - Buffy Episode 7x5 'Selfless' (#127) Transcript\n\nBuffy Episode #127: \"Selfless\" \n Transcript\nWritten by Drew Goddard\n Original Air Date: October 22, 2002 Skip Teaser.. Take Me To Beginning Of Episode. \n\n \n \n NB: The content of this transcript, including the characters \n and the story, belongs to Mutant Enemy. This transcript was created \n based on the broadcast episode.\n \n \n \n \n BUFFYWORLD.COM \n prefers that you direct link to this transcript rather than post \n it on your site, but you can post it on your site if you really \n want, as long as you keep everything intact, this includes the link \n to buffyworld.com and this writing. Please also keep the disclaimers \n intact.\n \n Originally transcribed for: http://www.buffyworld.com/.\n\t \n TEASER (RECAP SEGMENT):\n GILES (V.O.)\n\n Previousl... <truncated>
我现在正在尝试分割每个角色的名字(我有一个完整的清单)。例如,上面的'GILES'。这工作正常,但如果我在那里拆分我不能保留字符名称。这是一个简化的例子。
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
all <- strsplit(all, to_parse)
这给了我想要的分割,但不保留角色名称。
有限问题:保留那个角色名称的任何方法我正在做什么? 无限问题:我应该尝试其他任何方法吗?
提前致谢!
答案 0 :(得分:3)
我认为你可以使用perl兼容的正则表达式strsplit
。为了便于说明,我使用了较短的示例字符串,但它的工作方式应该相同:
string <- "text BUFFY more text WILLOW other text"
to_parse <- paste(c('BUFFY', 'WILLOW'), collapse = '|')
strsplit(string, paste0("(?<=", to_parse, ")"), perl = TRUE)
#[[1]]
#[1] "text BUFFY" " more text WILLOW" " other text"
正如@Lamia所建议的那样,如果你在文本之前有了名字,你可以做一个积极的预测。我略微编辑了建议,以便拆分字符串包含分隔符。
strsplit(string, paste0("(?<=.(?=", to_parse, "))"), perl = TRUE)
#[[1]]
#[1] "text " "BUFFY more text " "WILLOW other text"