我有与此相似的文本文件。
Section A - Blah blah
Random sentence.
Section B - Hello
Random sentence.
SECTION C - Random sentence
Random sentence.
SECTION D - Hi
Part A - Hey
PART B - howdy
Task 1: Blah
Task 2: Blah
我正在尝试获取:
Section A Blah blah
Random sentence.
Section B Hello
Random sentence.
SECTION C Random sentence
Random sentence.
SECTION D Hi
Part A Hey
PART B howdy
Task 1 Blah
Task 2 Blah
我正在尝试检测文本中的模式,例如“ Section”,不区分大小写,后跟字母或“ Task”,后跟数字,并删除该行中的标点符号。我想知道如何才能尽可能地做到这一点。
答案 0 :(得分:4)
编辑: :通过在其上添加更多检查来添加解决方案。
fd <- read.table(text="Section A - Blah blah
Random sentence.
Section B - Hello
Random sentence.
SECTION C - Random sentence
Random sentence.
SECTION D - Hi
Part A - Hey
PART B - howdy
Task 1: Blah
Task 2: Blah", header = FALSE)
fd %>%
gsub("(Section[^-]*)-(.*)","\\1 \\2",.) %>%
gsub("(Task[^:]*):(.*)","\\1 \\2",.)
输出如下。
[1] "Section A Blah blah\nRandom sentence.\nSection B Hello\nRandom sentence.\nSECTION C Random sentence\nRandom sentence.\nSECTION D - Hi\nPart A - Hey\nPART B - howdy\nTask 1 Blah\nTask 2 Blah"
以下内容可能会对您有所帮助。
gsub("-|:","",var)
以下是变量的样本数据。
var <- c("Section A - Blah blah
Random sentence.
Section B - Hello
Random sentence.
SECTION C - Random sentence
Random sentence.
SECTION D - Hi
Part A - Hey
PART B - howdy
Task 1: Blah
Task 2: Blah")