如何在Rstudio中使用特定单词从文本中提取句子?

时间:2017-12-05 05:30:13

标签: r

我想提取在包含多个段落的文本文件中有特定单词的句子。

例如: 数字印度是印度政府的一项举措,旨在通过改善在线基础设施和增加互联网连接,确保以电子方式向公民提供政府服务。它由总理纳伦德拉莫迪于2015年7月1日发起。

现在来自本段我需要提取所有包含“" India"”这一词的句子。

我尝试在R中使用substr和substring命令,但没有帮助。 有人请在这个问题上帮助我。

先谢谢

2 个答案:

答案 0 :(得分:3)

您可以像这样使用grep

text <- c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi.")
text <- unlist(strsplit(text, "\\."))

text[grep(pattern = "India", text, ignore.case = T)]

[1] "Digital India is an initiative by the Government of India ...

答案 1 :(得分:1)

使用正则表达式和grep(或者,就此而言,很可能是R中的任何模式匹配函数)提供对从给定输入字符串中提取的要素的更精细控制。也就是说,来自 stringr 的base-R regmatches(与regexpr组合)或str_extract_all有助于完成您的特定任务,而无需明确要求拆分您的输入矢量预先。

例如,提取任何包含“印度”字样的句子。可以使用以下表达式轻松实现。请注意,我添加了另一个包含&#39; India&#39;以衍生形式出于说明目的。

text = "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity. It was launched on 1 July 2015 by Prime Minister Narendra Modi."
text = paste(text, "Indian summer is a periodically recurring weather phenomenon in Central Europe.")

library(stringr)
str_extract_all(text, "([:alnum:]+\\s)*India[[:alnum:]\\s]*\\.")[[1]]

[1] "Digital India is an initiative by the Government of India ensuring that Government services are made available to citizens electronically by improving online infrastructure and by increasing Internet connectivity."
[2] "Indian summer is a periodically recurring weather phenomenon in Central Europe."

网上有很多关于正则表达式的优秀教程,所以我在这里饶有你的详细信息。为了破译上述陈述,Regular Expressions in R可能是一个很好的起点。