我正在尝试从文本中提取具有相同模式的字符串
The Tragedy of Romeo and Juliet by William Shakespeare
library(readr)
txt <- read_file('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
文字示例:
场景I. \ r \ nVerona。一个公共场所。\ r \ n \ r \ n输入Sampson和Gregory (带剑和扣环)的房子\ r \ n of Capulet ...
场景二。\ r \ n街。\ r \ n \ r \ n进入巴黎县的Capulet和[仆人] - 小丑。\ r \ n \ r \ n \ r \ n Cap。
我想提取
维罗纳。公共场所。
一条街
我试过
library(stringr)
str_extract(txt, "Scene\\s[IV]+\\.\\s\\s\\b[A-Z]+\\b")
它不起作用。
提前感谢您的建议。
答案 0 :(得分:1)
str_extract_all(gsub("(Scene.*?)\r\n","\\1 ",txt),"Scene.*")
[[1]]
[1] "Scene I. Verona. A public place."
[2] "Scene II. A Street."
[3] "Scene III. Capulet's house."
[4] "Scene IV. A street."
[5] "Scene V. Capulet's house."
[6] "Scene I. A lane by the wall of Capulet's orchard."
[7] "Scene II. Capulet's orchard."
[8] "Scene III. Friar Laurence's cell."
[9] "Scene IV. A street."
[10] "Scene V. Capulet's orchard."
[11] "Scene VI. Friar Laurence's cell."
[12] "Scene I. A public place."
[13] "Scene II. Capulet's orchard."
[14] "Scene III. Friar Laurence's cell."
[15] "Scene IV. Capulet's house"
[16] "Scene V. Capulet's orchard."
[17] "Scene I. Friar Laurence's cell."
[18] "Scene II. Capulet's house."
[19] "Scene III. Juliet's chamber."
[20] "Scene IV. Capulet's house."
[21] "Scene V. Juliet's chamber."
[22] "Scene I. Mantua. A street."
[23] "Scene II. Verona. Friar Laurence's cell."
[24] "Scene III. Verona. A churchyard; in it the monument of the Capulets."