Question

这是我的文章数据。

＃

inp <- Sentence1+sentence2+.......+ LAST SENTENCE OF THE ARTICLE+A version of this article appears in print on 08/05/2015, on page C3 of the....

我想做两件事。

首先，我想摆脱所有句子来自 “本文的一个版本出现在print中”。

其次，我想提取 C3 中的“本文的A版本出现在2015年5月8日的打印，页面 C3 ”句子。

我尝试使用str_replace_all函数执行这些操作，但我不能。

Answer 1

测试用例：

art <- "Sentence1+sentence2+.......+ LAST SENTENCE OF THE ARTICLE+ A version of this article appears in print on 08/05/2015, on page C3 of the Archive copy. The archive can be fouund here, blah, blah. And more blah, blah, blah."

首先删除不需要的材料到页面ref（包括“page”之后的空格）。我们假设所有文章都有dd / nn / YYYY格式的日期;

> pgref <- gsub("^.+appears\\ in\\ print\\ on\\ \\d{2}/\\d{2}/\\d{4}.+page\\ ", "", art)
> pgref
[1] "C3 of the Archive copy. The archive can be fouund here, blah, blah. And more blah, blah, blah."
> pgref <- gsub("\\ .+$", "", pgref)
> pgref
[1] "C3"

然后继续删除尾随的东西：

> trimart <- gsub("A version of this article\\ appears\\ in\\ print\\ on\\ \\d{2}/\\d{2}/\\d{4}.+$", "", art)
> trimart
[1] "Sentence1+sentence2+.......+ LAST SENTENCE OF THE ARTICLE+ "

R中的模式匹配和字符串操作

1 个答案: