Question

类似于这里的question，我想用R中的Regex提取字符串中的char序列。我想从文本文档中提取节，从而得到一个数据帧，其中每个子节都被视为自己的向量，用于进一步的文本挖掘。这是我的示例数据：

chapter_one <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
1 Introduction
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. 
1.1 Futher
The bedding was hardly able to cover it and seemed ready to slide off any moment. 
1.1.1 This Should be Part of One Point One
His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.
1.2 Futher Fuhter
'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")

这是我的预期输出：

chapter_id <- (c("1 Introduction", "1.1 Futher", "1.2 Futher Futher")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.", "The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.", "'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls."))

chapter_one_df <- data.frame(chapter_id, text)

到目前为止，我尝试过的事情是这样的：

library(stringr)

regex_chapter_heading <- regex("
          [:digit:]     # Digit number 
                        # MISSING: Optional dot and optional second digit number 
          \\s           # Space
          ([[:alpha:]]) # Alphabetic characters (MISSING: can also contain punctuation, as in 'Introduction - A short introduction')
                     ", comments = TRUE)

read.table(text=gsub(regex_chapter_heading,"\\1:",chapter_one),sep=":")

到目前为止，这还不能产生预期的输出-因为如前所述，正则表达式的某些部分仍然缺失。非常感谢您的帮助！

Answer 1

您可以尝试以下方法：1）替换以三个点分隔的数字开头的所有行（因为这是先前项目符号的延续），并且2）使用数字+可选的点+数字作为分隔符来提取零件模式，同时捕获第一行和随后的行，将它们分成单独的捕获组：

library(stringr)
# Replace lines starting with N.N.N+ with space
chapter_one <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", chapter_one, perl=TRUE)
# Split into IDs and Texts
data <- str_match_all(chapter_one, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")
# Get the chapter ID column
chapter_id <- trimws(data[[1]][,2])
# Get the text ID column
text <- trimws(data[[1]][,3])
# Create the target DF
chapter_one_df <- data.frame(chapter_id, text)

输出：

         chapter_id
1    1 Introduction
2        1.1 Futher
3 1.2 Futher Fuhter
                                                                                                                                                                                              text
1                                       He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
2 The bedding was hardly able to cover it and seemed ready to slide off any moment.  His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.
3                               'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.

\R\d+(?:\.\d+){2,}\s+[A-Z].*\R?模式用于用空格替换要“排除”的行：

\R-换行符
\d+-1个以上数字
(?:\.\d+){2,}-两个或两个以上.和1+个数字的重复
\s+-1个以上的空格（用\h替换单个水平空白，或用\h+替换1个或更多空白）
[A-Z]-大写字母
.*-除换行符以外的任何0+个字符，直到行尾为止都尽可能多
\R?-可选的换行符字符序列。

第二个正则表达式非常复杂：

(?sm)^(\d+(?:\.\d+)?\s+[A-Z][^\r\n]*)\R(.*?)(?=\R\d+(?:\.\d+)?\s+[A-Z]|\z)

请参见regex demo。

详细信息

(?sm)-s使.匹配任何字符，而m使^匹配一行的开头
^-一行的开头
(\d+(?:\.\d+)?\s+[A-Z][^\r\n]*)-第1组：一个或多个数字，然后是.和1+个数字的1或0个重复，1 +个空格，一个大写字母，除CR和LF之外的任何0+个字符尽可能多的符号
\R-换行符
(.*?)-第2组：在首次出现以下字符时，尽可能少的0+个字符
- \R\d+(?:\.\d+)?\s+[A-Z]-换行符，一个或多个数字，然后是.的1或0个重复和1+个数字，1 +个空格，一个大写字母
- |-或
- \z-字符串的结尾。

从文本中提取章节

1 个答案: