用R语句将段落改为句子

时间:2016-12-06 02:25:22

标签: r regex

我试图将一个段落分成句子,但它不起作用。我觉得这应该是一件容易的事情,就像我必须犯了一个愚蠢的错误。我正在使用字符串拆分,但想弄清楚正则表达式。

示例:

lorem <- "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

strsplit(lorem, "[.]") 

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry"                                                                                                                                      
[2] " Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book"                                     
[3] " It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged"                                                                                       
[4] " It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum"

但是当我使用正则表达式时:

grep("[^\\.\\!\\?]*[\\.\\!\\?]", lorem, value=TRUE, perl=TRUE )

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

它只是弹出原始输入

2 个答案:

答案 0 :(得分:1)

我们可以使用包qdap

library(qdap)
sent_detect(lorem)

输出:

  

[1]“Lorem Ipsum只是打印和排版的虚拟文本   产业“。
  [2]“Lorem Ipsum一直是业界标准的虚拟文本   自16世纪以来,当一个未知的打印机采用了类型的厨房   把它拼凑成一本样本书。“
  [3]“它不仅存在了五个世纪,而且还有一个跨越   电子排版,基本保持不变。“
  [4]“随着Letraset的发布,它在20世纪60年代得到了普及   包含Lorem Ipsum段落的表格,以及最近使用桌面的表格   像Aldus PageMaker这样的出版软件,包括Lorem的版本   存有“。

答案 1 :(得分:0)

正如@akrun所说,grep只检查整个字符串中是否存在模式。

为了执行您的任务,我们可以在str_match_all包中使用stringr,它会从字符串中提取匹配的组。

unlist( stringr::str_match_all(lorem, "[^\\s][^\\.\\!\\?]+[\\.\\!\\?]{1}") )

输出:

  

[1]&#34; Lorem Ipsum只是印刷和排版行业的虚拟文本。&#34;
  [2]&#34; Lorem Ipsum自16世纪以来一直是业界标准的虚拟文本,当时一个未知的打印机拿了一个类型的厨房,并把它拼凑成一个类型的样本书。&#34;
  [3]&#34;它不仅存在了五个世纪,而且还延续了电子排版,基本保持不变。&#34;
  [4]&#34;它在20世纪60年代推出了包含Lorem Ipsum段落的Letraset表,以及最近使用Aldus PageMaker等桌面出版软件,包括Lorem Ipsum版本。&#34;