Question

您好，我遇到了问题。我的文本文件与此类似：

Section 1 Blah blah blah
Random sentence.
Section 2 Blah blah blah
Random sentence.
Section 564 of the blah blah blah.
Section 578 of the blah blah blah had
the following requirements.

我正试图获得：

Section 1 Blah blah blah
Section 2 Blah blah blah

但是，我得到了：

Section 1 Blah blah blah
Section 2 Blah blah blah
Section 564 of the blah blah blah.
Section 578 of the blah blah blah had

我得到的代码是：

grep("(^(\\w+)\\s\\d+\\s+)",file, value=TRUE)

我正在尝试获取具有任意长度的数字的任何单词模式，因此在这种情况下，Section具有任意数字，后跟一些空格和该行的其余部分。但是，如果此行的内容是带句点的完整句子，我就不想获得它。我不确定该怎么做。

Answer 1

除了正则表达式中的内容外，您可以选择匹配其他任何内容。分解模式：

[^]将匹配括号内^之后的所有内容
\\.是文字.，已转义，因此并不表示“任何字符”
$表示字符串的结尾。

因此，它匹配以句点以外的任何结尾的任何字符串。如果需要，可以在此前面添加其他图案元素。

已更新，以解决字符串开头的小写字母。我们可以找出其中的哪一个，然后删除索引较小的索引。然后像以前一样删除那些以句号结尾的内容。

text = c(
  "Section 1 Blah blah blah",
  "Random sentence.",
  "Section 2 Blah blah blah",
  "Random sentence.",
  "Section 564 of the blah blah blah.",
  "Section 578 of the blah blah blah had",
  "the following requirements."
)

library(stringr)

remove_sentences <- function(strings){
  lower <- str_which(strings, "^[:lower:]")
  no_next_lower <- strings[-(lower - 1)]
  str_subset(no_next_lower, "[^\\.]$")
}

text %>%
  remove_sentences %>%
  writeLines
#> Section 1 Blah blah blah
#> Section 2 Blah blah blah

由reprex package（v0.2.0）于2018-06-29创建。

Answer 2

您可以扩展正则表达式以匹配任何字符，直到该行的末尾，但在末尾禁止使用文字.。原始问题的示例：

file <- c('Section 1 Blah blah blah',
'Random sentence.',
'Section 2 Blah blah blah',
'Random sentence.',
'Section 564 of the blah blah blah.')

grep("(^(\\w+)\\s\\d+\\s+.*[^\\.]$)",file, value=TRUE)
#> [1] "Section 1 Blah blah blah" "Section 2 Blah blah blah"

Answer 3

可以通过检查当前行的末尾是否不包含任何.以及下一行以lower-case开始来达到预期的结果。一种选择是使用dplyr::lead获取下一行，而另一种选择是使用tail(text,-1)中的base-R。

解决方案将为：

text <- c(  
"Section 1 Blah blah blah",
"Random sentence.",
"Section 2 Blah blah blah",
"Random sentence.",
"Section 564 of the blah blah blah.",
"Section 578 of the blah blah blah had",
"the following requirements.")

# The below code select a line that starts with caps-letter, doesn't 
# contains . and next line is not started with lower-letter.
text[grepl("^[A-Z].*[^.]$",text) & !c(tail(grepl("^[a-z].*",text),-1),FALSE)]

# [1] "Section 1 Blah blah blah"
# [2] "Section 2 Blah blah blah"

使用的正则表达式：

A。 “ ^ [A-Z]。* [^。] $”

^[A-Z]-以大写字母开头
.*-后跟任意数量的任意字符
[^.]$-不以.结尾

B。 “ ^ [a-z]。*”

^[a-z]-以小写字母开头
.*-后跟任意数量的任意字符

仅提取末尾没有句点的行

3 个答案: